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Preface 


Purpose 


The purpose of this book is to give the reader a better understanding of 
how computers really work at a lower level than in programming languages 
like Pascal. By gaining a deeper understanding of how computers work, the 
reader can often be much more productive developing software in higher level 
languages such as C and C++. Learning to program in assembly language 
is an excellent way to achieve this goal. Other PC assembly language books 
still teach how to program the 8086 processor that the original PC used in 
1981! The 8086 processor only supported real mode. In this mode, any 
program may address any memory or device in the computer. This mode is 
not suitable for a secure, multitasking operating system. This book instead 
discusses how to program the 80386 and later processors in protected mode 
(the mode that Windows and Linux runs in). This mode supports the 
features that modern operating systems expect, such as virtual memory and 
memory protection. There are several reasons to use protected mode: 


1. It is easier to program in protected mode than in the 8086 real mode 
that other books use. 


2. All modern PC operating systems run in protected mode. 
3. There is free software available that runs in this mode. 


The lack of textbooks for protected mode PC assembly programming is the 
main reason that the author wrote this book. 

As alluded to above, this text makes use of Free/Open Source software: 
namely, the NASM assembler and the DJGPP C/C++ compiler. Both 
of these are available to download from the Internet. The text also dis- 
cusses how to use NASM assembly code under the Linux operating sys- 
tem and with Borland’s and Microsoft’s C/C++ compilers under Win- 
dows. Examples for all of these platforms can be found on my web site: 
http: //pacmani28.github.io/pcasm/. You must download the example 
code if you wish to assemble and run many of the examples in this tutorial. 
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Be aware that this text does not attempt to cover every aspect of assem- 
bly programming. The author has tried to cover the most important topics 
that all programmers should be acquainted with. 
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Chapter 1 


Introduction 


1.1 Number Systems 


Memory in a computer consists of numbers. Computer memory does 
not store these numbers in decimal (base 10). Because it greatly simplifies 
the hardware, computers store all information in a binary (base 2) format. 
First let’s review the decimal system. 


1.1.1 Decimal 


Base 10 numbers are composed of 10 possible digits (0-9). Each digit of 
a number has a power of 10 associated with it based on its position in the 
number. For example: 


234 = 2 x 10? +3 x 101 +4 x 10° 


1.1.2 Binary 


Base 2 numbers are composed of 2 possible digits (0 and 1). Each digit 
of a number has a power of 2 associated with it based on its position in the 
number. (A single binary digit is called a bit.) For example!: 


110012 = 1x24+1x2+0x2 +0x2!+1x2 
= 16+8+1 
= 25 


This shows how binary may be converted to decimal. Table 1.1 shows 
how the first few numbers are represented in binary. 

Figure 1.1 shows how individual binary digits (7.e., bits) are added. 
Here’s an example: 


'The 2 subscript is used to show that the number is represented in binary, not decimal 


1 
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Decimal | Binary Decimal | Binary 
0 0000 8 1000 
1 0001 9 1001 
2 0010 10 1010 
3 0011 11 1011 
4 0100 12 1100 
5 0101 13 1101 
6 0110 14 1110 
7 0111 15 1111 


Table 1.1: Decimal 0 to 15 in Binary 


No previous carry Previous carry 
0 0 1 1 0 0 1 
+0 +1 +0 +1 +0 +1 +0 +1 
0 1 1 0 1 0 0 
c c c c 


Figure 1.1: Binary addition (c stands for carry) 


110112 

+100012 

1011002 
If one considers the following decimal division: 


1234 + 10 = 123r 4 


he can see that this division strips off the rightmost decimal digit of the 
number and shifts the other decimal digits one position to the right. Dividing 
by two performs a similar operation, but for the binary digits of the number. 
Consider the following binary division: 


11012 + 102 = 1102r 1 


This fact can be used to convert a decimal number to its equivalent binary 
representation as Figure 1.2 shows. This method finds the rightmost digit 
first, this digit is called the least significant bit (lsb). The leftmost digit is 
called the most significant bit (msb). The basic unit of memory consists of 
8 bits and is called a byte. 


1.1.3 Hexadecimal 


Hexadecimal numbers use base 16. Hexadecimal (or hex for short) can 
be used as a shorthand for binary numbers. Hex has 16 possible digits. This 


1.1. NUMBER SYSTEMS 3 


Decimal Binary 
25+2=12r1 11001+10=1100r1 
12+2=6r0 1100+10=110r0 

6+2=3r0 110+10=11r0 
3+2=I1rl 11+10=I1r1 
1+2=0rl1 1+10=0rl 


Thus 2510 = 110012 


Figure 1.2: Decimal conversion 


589 +16 = 36r13 
36—16 = 2r4 
2+16 Or2 


Thus 589 = 24D 16 


Figure 1.3: 


creates a problem since there are no symbols to use for these extra digits 
after 9. By convention, letters are used for these extra digits. The 16 hex 
digits are 0-9 then A, B, C, D, E and F. The digit A is equivalent to 10 
in decimal, B is 11, etc. Each digit of a hex number has a power of 16 
associated with it. Example: 


2BDig = 2x 167+11 x 16' +13 x 16° 
= 512+176+ 13 
= 701 


To convert from decimal to hex, use the same idea that was used for binary 
conversion except divide by 16. See Figure 1.3 for an example. 

The reason that hex is useful is that there is a very simple way to convert 
between hex and binary. Binary numbers get large and cumbersome quickly. 
Hex provides a much more compact way to represent binary. 

To convert a hex number to binary, simply convert each hex digit to a 
4-bit binary number. For example, 24D16 is converted to 0010 0100 11012. 
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word 2 bytes 
double word | 4 bytes 
quad word 8 bytes 
paragraph 16 bytes 


Table 1.2: Units of Memory 


Note that the leading zeros of the 4-bits are important! If the leading zero 
for the middle digit of 24D 1g is not used the result is wrong. Converting 
from binary to hex is just as easy. One does the process in reverse. Convert 
each 4-bit segments of the binary to hex. Start from the right end, not the 
left end of the binary number. This ensures that the process uses the correct 
4-bit segments”. Example: 


110 0000 0101 1010 0111 11109 
6 0 5 A 7 E16 


A 4-bit number is called a nibble . Thus each hex digit corresponds to 
a nibble. Two nibbles make a byte and so a byte can be represented by a 
2-digit hex number. A byte’s value ranges from 0 to 11111111 in binary, 0 
to FF in hex and 0 to 255 in decimal. 


1.2 Computer Organization 


1.2.1 Memory 


Memory is measured in The basic unit of memory is a byte. A computer with 32 megabytes 
units of kilobytes ( 21° = of memory can hold roughly 32 million bytes of information. Each byte in 


1,024 bytes), megabytes memory is labeled by a unique number known as its address as Figure 1.4 
( 27° = 1,048,576 bytes shows 


and gigabytes ( 299 = 
BOTS TAT Oa utes): Address 0 1 2 3 4 5 6 7 
Memory | 2A | 45 | B8 | 20 | 8F | CD | 12 | 2E | 


Figure 1.4: Memory Addresses 


Often memory is used in larger chunks than single bytes. On the PC 
architecture, names have been given to these larger sections of memory as 
Table 1.2 shows. 


7If it is not clear why the starting point makes a difference, try converting the example 
starting at the left. 
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All data in memory is numeric. Characters are stored by using a char- 
acter code that maps numbers to characters. One of the most common 
character codes is known as ASCII (American Standard Code for Informa- 
tion Interchange). A new, more complete code that is supplanting ASCII 
is Unicode. One key difference between the two codes is that ASCII uses 
one byte to encode a character, but Unicode uses multiple bytes. There 
are several different forms of Unicode. On x86 C/C++ compilers, Unicode 
is represented in code using the wchar_t type and the UTF-16 encoding 
which uses 16 bits (or a word) per character. For example, ASCII maps the 
byte 4116 (6539) to the character capital A; UTF-16 maps it to the word 
004116. Since ASCII uses a byte, it is limited to only 256 different charac- 
ters?. Unicode extends the ASCII values and allows many more characters 
to be represented. This is important for representing characters for all the 
languages of the world. 


1.2.2 The CPU 


The Central Processing Unit (CPU) is the physical device that performs 
instructions. The instructions that CPUs perform are generally very simple. 
Instructions may require the data they act on to be in special storage loca- 
tions in the CPU itself called registers. The CPU can access data in registers 
much faster than data in memory. However, the number of registers in a 
CPU is limited, so the programmer must take care to keep only currently 
used data in registers. 

The instructions a type of CPU executes make up the CPU’s machine 
language. Machine programs have a much more basic structure than higher- 
level languages. Machine language instructions are encoded as raw numbers, 
not in friendly text formats. A CPU must be able to decode an instruction’s 
purpose very quickly to run efficiently. Machine language is designed with 
this goal in mind, not to be easily deciphered by humans. Programs written 
in other languages must be converted to the native machine language of 
the CPU to run on the computer. A compiler is a program that translates 
programs written in a programming language into the machine language of 
a particular computer architecture. In general, every type of CPU has its 
own unique machine language. This is one reason why programs written for 
a Mac can not run on an IBM-type PC. 

Computers use a clock to synchronize the execution of the instructions. 
The clock pulses at a fixed frequency (known as the clock speed). When you 
buy a 1.5 GHz computer, 1.5 GHz is the frequency of this clock*. The clock 
does not keep track of minutes and seconds. It simply beats at a constant 


3In fact, ASCII only uses the lower 7-bits and so only has 128 different values to use. 
“Actually, clock pulses are used in many different components of a computer. The 
other components often use different clock speeds than the CPU. 


GHz stands for gigahertz 
or one billion cycles per 
second. A 1.5 GHz CPU 
has 1.5 billion clock pulses 
per second. 
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rate. The electronics of the CPU uses the beats to perform their operations 
correctly, like how the beats of a metronome help one play music at the 
correct rhythm. The number of beats (or as they are usually called cycles) 
an instruction requires depends on the CPU generation and model. The 
number of cycles depends on the instructions before it and other factors as 
well. 


1.2.3 The 80x86 family of CPUs 


IBM-type PC’s contain a CPU from Intel’s 80x86 family (or a clone of 
one). The CPU’s in this family all have some common features including a 
base machine language. However, the more recent members greatly enhance 
the features. 


8088,8086: These CPU’s from the programming standpoint are identical. 
They were the CPU’s used in the earliest PC’s. They provide several 
16-bit registers: AX, BX, CX, DX, SI, DI, BP, SP, CS, DS, SS, ES, IP, 
FLAGS. They only support up to one megabyte of memory and only 
operate in real mode. In this mode, a program may access any memory 
address, even the memory of other programs! This makes debugging 
and security very difficult! Also, program memory has to be divided 
into segments. Each segment can not be larger than 64K. 


80286: This CPU was used in AT class PC’s. It adds some new instructions 
to the base machine language of the 8088/86. However, its main new 
feature is 16-bit protected mode. In this mode, it can access up to 16 
megabytes and protect programs from accessing each other’s memory. 
However, programs are still divided into segments that could not be 
bigger than 64K. 


80386: This CPU greatly enhanced the 80286. First, it extends many of 
the registers to hold 32-bits (EAX, EBX, ECX, EDX, ESI, EDI, EBP, 
ESP, EIP) and adds two new 16-bit registers FS and GS. It also adds 
a new 32-bit protected mode. In this mode, it can access up to 4 
gigabytes. Programs are again divided into segments, but now each 
segment can also be up to 4 gigabytes in size! 


80486/Pentium/Pentium Pro: These members of the 80x86 family add 
very few new features. They mainly speed up the execution of the 
instructions. 


Pentium MMxX: This processor adds the MMX (MultiMedia eXtensions) 
instructions to the Pentium. These instructions can speed up common 
graphics operations. 
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AX 
AH | AL 


Figure 1.5: The AX register 


Pentium IT: This is the Pentium Pro processor with the MMX instructions 
added. (The Pentium III is essentially just a faster Pentium II.) 


1.2.4 8086 16-bit Registers 


The original 8086 CPU provided four 16-bit general purpose registers: 
AX, BX, CX and DX. Each of these registers could be decomposed into 
two 8-bit registers. For example, the AX register could be decomposed into 
the AH and AL registers as Figure 1.5 shows. The AH register contains 
the upper (or high) 8 bits of AX and AL contains the lower 8 bits of AX. 
Often AH and AL are used as independent one byte registers; however, it is 
important to realize that they are not independent of AX. Changing AX’s 
value will change AH and AL and vice versa. The general purpose registers 
are used in many of the data movement and arithmetic instructions. 

There are two 16-bit index registers: SI and DI. They are often used 
as pointers, but can be used for many of the same purposes as the general 
registers. However, they can not be decomposed into 8-bit registers. 

The 16-bit BP and SP registers are used to point to data in the ma- 
chine language stack and are called the Base Pointer and Stack Pointer, 
respectively. These will be discussed later. 

The 16-bit CS, DS, SS and ES registers are segment registers. They 
denote what memory is used for different parts of a program. CS stands 
for Code Segment, DS for Data Segment, SS for Stack Segment and ES for 
Extra Segment. ES is used as a temporary segment register. The details of 
these registers are in Sections 1.2.6 and 1.2.7. 

The Instruction Pointer (IP) register is used with the CS register to 
keep track of the address of the next instruction to be executed by the 
CPU. Normally, as an instruction is executed, IP is advanced to point to 
the next instruction in memory. 

The FLAGS register stores important information about the results of 
a previous instruction. These results are stored as individual bits in the 
register. For example, the Z bit is 1 if the result of the previous instruction 
was zero or 0 if not zero. Not all instructions modify the bits in FLAGS, 
consult the table in the appendix to see how individual instructions affect 
the FLAGS register. 
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mous DOS 640K limit 
come from? The BIOS 
required some of the 1M 
for its code and for hard- 
ware devices like the video 
screen. 
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1.2.5 80386 32-bit registers 


The 80386 and later processors have extended registers. For example, 
the 16-bit AX register is extended to be 32-bits. To be backward compatible, 
AX still refers to the 16-bit register and EAX is used to refer to the extended 
32-bit register. AX is the lower 16-bits of EAX just as AL is the lower 8- 
bits of AX (and EAX). There is no way to access the upper 16-bits of EAX 
directly. The other extended registers are EBX, ECX, EDX, ESI and EDI. 

Many of the other registers are extended as well. BP becomes EBP; SP 
becomes ESP; FLAGS becomes EFLAGS and IP becomes EIP. However, 
unlike the index and general purpose registers, in 32-bit protected mode 
(discussed below) only the extended versions of these registers are used. 

The segment registers are still 16-bit in the 80386. There are also two 
new segment registers: FS and GS. Their names do not stand for anything. 
They are extra temporary segment registers (like ES). 

One of definitions of the term word refers to the size of the data registers 
of the CPU. For the 80x86 family, the term is now a little confusing. In 
Table 1.2, one sees that word is defined to be 2 bytes (or 16 bits). It was 
given this meaning when the 8086 was first released. When the 80386 was 
developed, it was decided to leave the definition of word unchanged, even 
though the register size changed. 


1.2.6 Real Mode 


In real mode, memory is limited to only one megabyte (27° bytes). Valid 
address range from (in hex) 00000 to FFFFF. These addresses require a 20- 
bit number. Obviously, a 20-bit number will not fit into any of the 8086’s 
16-bit registers. Intel solved this problem, by using two 16-bit values to 
determine an address. The first 16-bit value is called the selector. Selector 
values must be stored in segment registers. The second 16-bit value is called 
the offset. The physical address referenced by a 32-bit selector:offset pair is 
computed by the formula 


16 x selector + offset 


Multiplying by 16 in hex is easy, just add a 0 to the right of the number. 
For example, the physical addresses referenced by 047C:0048 is given by: 


047C0 
+0048 
04808 


In effect, the selector value is a paragraph number (see Table 1.2). 
Real segmented addresses have disadvantages: 
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e A single selector value can only reference 64K of memory (the upper 
limit of the 16-bit offset). What if a program has more than 64K of 
code? A single value in CS can not be used for the entire execution 
of the program. The program must be split up into sections (called 
segments) less than 64K in size. When execution moves from one seg- 
ment to another, the value of CS must be changed. Similar problems 
occur with large amounts of data and the DS register. This can be 
very awkward! 


e Each byte in memory does not have a unique segmented address. The 
physical address 04808 can be referenced by 047C:0048, 047D:0038, 
047E:0028 or 047B:0058. This can complicate the comparison of seg- 
mented addresses. 


1.2.7 16-bit Protected Mode 


In the 80286’s 16-bit protected mode, selector values are interpreted 
completely differently than in real mode. In real mode, a selector value 
is a paragraph number of physical memory. In protected mode, a selector 
value is an index into a descriptor table. In both modes, programs are 
divided into segments. In real mode, these segments are at fixed positions 
in physical memory and the selector value denotes the paragraph number 
of the beginning of the segment. In protected mode, the segments are not 
at fixed positions in physical memory. In fact, they do not have to be in 
memory at all! 

Protected mode uses a technique called virtual memory . The basic idea 
of a virtual memory system is to only keep the data and code in memory that 
programs are currently using. Other data and code are stored temporarily 
on disk until they are needed again. In 16-bit protected mode, segments are 
moved between memory and disk as needed. When a segment is returned 
to memory from disk, it is very likely that it will be put into a different area 
of memory that it was in before being moved to disk. All of this is done 
transparently by the operating system. The program does not have to be 
written differently for virtual memory to work. 

In protected mode, each segment is assigned an entry in a descriptor 
table. This entry has all the information that the system needs to know 
about the segment. This information includes: is it currently in memory; 
if in memory, where is it; access permissions (e.g., read-only). The index 
of the entry of the segment is the selector value that is stored in segment 
registers. 

One big disadvantage of 16-bit protected mode is that offsets are still 


One well-known PC 


16-bit quantities. As a consequence of this, segment sizes are still limited to columnist called the 286 


at most 64K. This makes the use of large arrays problematic! 


CPU “brain dead.” 
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1.2.8 32-bit Protected Mode 


The 80386 introduced 32-bit protected mode. There are two major dif- 
ferences between 386 32-bit and 286 16-bit protected modes: 


1. Offsets are expanded to be 32-bits. This allows an offset to range up 
to 4 billion. Thus, segments can have sizes up to 4 gigabytes. 


2. Segments can be divided into smaller 4K-sized units called pages. The 
virtual memory system works with pages now instead of segments. 
This means that only parts of segment may be in memory at any one 
time. In 286 16-bit mode, either the entire segment is in memory or 
none of it is. This is not practical with the larger segments that 32-bit 
mode allows. 


In Windows 3.x, standard mode referred to 286 16-bit protected mode 
and enhanced mode referred to 32-bit mode. Windows 9X, Windows NT/2000/XP, 
OS/2 and Linux all run in paged 32-bit protected mode. 


1.2.9 Interrupts 


Sometimes the ordinary flow of a program must be interrupted to process 
events that require prompt response. The hardware of a computer provides 
a mechanism called interrupts to handle these events. For example, when 
a mouse is moved, the mouse hardware interrupts the current program to 
handle the mouse movement (to move the mouse cursor, etc.) Interrupts 
cause control to be passed to an interrupt handler. Interrupt handlers are 
routines that process the interrupt. Each type of interrupt is assigned an 
integer number. At the beginning of physical memory, a table of inter- 
rupt vectors resides that contain the segmented addresses of the interrupt 
handlers. The number of interrupt is essentially an index into this table. 

External interrupts are raised from outside the CPU. (The mouse is an 
example of this type.) Many I/O devices raise interrupts (e.g., keyboard, 
timer, disk drives, CD-ROM and sound cards). Internal interrupts are raised 
from within the CPU, either from an error or the interrupt instruction. Error 
interrupts are also called traps. Interrupts generated from the interrupt 
instruction are called software interrupts. DOS uses these types of interrupts 
to implement its API (Application Programming Interface). More modern 
operating systems (such as Windows and UNIX) use a C based interface. 5 

Many interrupt handlers return control back to the interrupted program 
when they finish. They restore all the registers to the same values they 
had before the interrupt occurred. Thus, the interrupted program runs as 
if nothing happened (except that it lost some CPU cycles). Traps generally 
do not return. Often they abort the program. 


> However, they may use a lower level interface at the kernel level. 
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1.3 Assembly Language 


1.3.1 Machine language 


Every type of CPU understands its own machine language. Instructions 
in machine language are numbers stored as bytes in memory. Each instruc- 
tion has its own unique numeric code called its operation code or opcode 
for short. The 80x86 processor’s instructions vary in size. The opcode is 
always at the beginning of the instruction. Many instructions also include 
data (e.g., constants or addresses) used by the instruction. 

Machine language is very difficult to program in directly. Deciphering 
the meanings of the numerical-coded instructions is tedious for humans. 
For example, the instruction that says to add the EAX and EBX registers 
together and store the result back into EAX is encoded by the following hex 
codes: 


03 C3 


This is hardly obvious. Fortunately, a program called an assembler can do 
this tedious work for the programmer. 


1.3.2 Assembly language 


An assembly language program is stored as text (just as a higher level 
language program). Each assembly instruction represents exactly one ma- 
chine instruction. For example, the addition instruction described above 
would be represented in assembly language as: 


add eax, ebx 


Here the meaning of the instruction is much clearer than in machine code. 
The word add is a mnemonic for the addition instruction. The general form 
of an assembly instruction is: 


mnemonic operand(s) 


An assembler is a program that reads a text file with assembly instruc- 
tions and converts the assembly into machine code. Compilers are programs 
that do similar conversions for high-level programming languages. An assem- 
bler is much simpler than a compiler. Every assembly language statement 
directly represents a single machine instruction. High-level language state- 
ments are much more complex and may require many machine instructions. 

Another important difference between assembly and high-level languages 
is that since every different type of CPU has its own machine language, it 
also has its own assembly language. Porting assembly programs between 


It took several years for 
computer scientists to fig- 
ure out how to even write 


a compiler! 
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different computer architectures is much more difficult than in a high-level 
language. 

This book’s examples uses the Netwide Assembler or NASM for short. It 
is freely available off the Internet (see the preface for the URL). More com- 
mon assemblers are Microsoft’s Assembler (MASM) or Borland’s Assembler 
(TASM). There are some differences in the assembly syntax for MASM/- 
TASM and NASM. 


1.3.3 Instruction operands 


Machine code instructions have varying number and type of operands; 
however, in general, each instruction itself will have a fixed number of oper- 
ands (0 to 3). Operands can have the following types: 


register: These operands refer directly to the contents of the CPU’s regis- 
ters. 


memory: These refer to data in memory. The address of the data may be 
a constant hardcoded into the instruction or may be computed using 
values of registers. Address are always offsets from the beginning of a 
segment. 


immediate: These are fixed values that are listed in the instruction itself. 
They are stored in the instruction itself (in the code segment), not in 
the data segment. 


implied: These operands are not explicitly shown. For example, the in- 
crement instruction adds one to a register or memory. The one is 
implied. 


1.3.4 Basic instructions 


The most basic instruction is the MOV instruction. It moves data from one 
location to another (like the assignment operator in a high-level language). 
It takes two operands: 


mov dest, src 


The data specified by src is copied to dest. One restriction is that both 
operands may not be memory operands. This points out another quirk of 
assembly. There are often somewhat arbitrary rules about how the various 
instructions are used. The operands must also be the same size. The value 
of AX can not be stored into BL. 

Here is an example (semicolons start a comment): 
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mov eax, 3 ; store 3 into EAX register (3 is immediate operand) 
mov bx, ax ; store the value of AX into the BX register 


The ADD instruction is used to add integers. 


add eax, 4 ; eax = eax + 4 
add al, ah ; al = al + ah 


The SUB instruction subtracts integers. 


sub bx, 10 ; bx = bx - 10 
sub ebx, edi ; ebx = ebx - edi 


The INC and DEC instructions increment or decrement values by one. 
Since the one is an implicit operand, the machine code for INC and DEC is 
smaller than for the equivalent ADD and SUB instructions. 


inc ecx ; ecxt++ 
dec dl ; dl-- 


1.3.5 Directives 


A directive is an artifact of the assembler not the CPU. They are gen- 
erally used to either instruct the assembler to do something or inform the 
assembler of something. They are not translated into machine code. Com- 
mon uses of directives are: 

e define constants 

e define memory to store data into 

e group memory into segments 

e conditionally include source code 

e include other files 

NASM code passes through a preprocessor just like C. It has many of 


the same preprocessor commands as C. However, NASM’s preprocessor di- 
rectives start with a % instead of a # as in C. 


The equ directive 


The equ directive can be used to define a symbol. Symbols are named 
constants that can be used in the assembly program. The format is: 


symbol equ value 


Symbol values can not be redefined later. 
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Unit Letter 
byte B 
word W 
double word D 
quad word Q 
ten bytes T 


Table 1.3: Letters for RESX and DX Directives 


The %define directive 


This directive is similar to C’s #define directive. It is most commonly 
used to define constant macros just as in C. 


%define SIZE 100 


Li 
L2 
L3 
L4 
L5 
L6 
L7 
L8 


mov eax, SIZE 


The above code defines a macro named SIZE and shows its use in a MOV 
instruction. Macros are more flexible than symbols in two ways. Macros 
can be redefined and can be more than simple constant numbers. 


Data directives 


Data directives are used in data segments to define room for memory. 
There are two ways memory can be reserved. The first way only defines 
room for data; the second way defines room and an initial value. The first 
method uses one of the RESX directives. The X is replaced with a letter that 
determines the size of the object (or objects) that will be stored. Table 1.3 
shows the possible values. 

The second method (that defines an initial value, too) uses one of the 
DX directives. The X letters are the same as those in the RESX directives. 

It is very common to mark memory locations with labels. Labels allow 
one to easily refer to memory locations in code. Below are several examples: 


db (0) 

dw 1000 

db 110101b 
db 12h 

db 170 

dd 1492h 
resb 1 

db "A" 


eT 


3’ 


3 


; byte 
; word 


byte 
byte 
byte 


labeled L1 with initial value 0 
labeled L2 with initial value 1000 


initialized to binary 110101 (53 in decimal) 


initialized to hex 12 (18 in decimal) 
initialized to octal 17 (15 in decimal) 
double word initialized to hex 1A92 

1 uninitialized byte 

byte initialized to ASCII code for A (65) 


Double quotes and single quotes are treated the same. Consecutive data 
definitions are stored sequentially in memory. That is, the word L2 is stored 
immediately after L1 in memory. Sequences of memory may also be defined. 


L9 
L10 
L11 


L12 
L13 
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db 0, 1, 2, 3 ; defines 4 bytes 
db Mag Moms. Me, 2d? 5.0 ; defines a C string = "word" 
db >word’, O ; same as L10 


The DD directive can be used to define both integer and single precision 
floating point® constants. However, the DQ can only be used to define double 
precision floating point constants. 

For large sequences, NASM’s TIMES directive is often useful. This direc- 
tive repeats its operand a specified number of times. For example, 


times 100 db 0 ; equivalent to 100 (db 0)’s 
resw 100 ; reserves room for 100 words 


Remember that labels can be used to refer to data in code. There are two 
ways that a label can be used. If a plain label is used, it is interpreted as the 
address (or offset) of the data. If the label is placed inside square brackets 
(E), it is interpreted as the data at the address. In other words, one should 
think of a label as a pointer to the data and the square brackets dereferences 
the pointer just as the asterisk does in C. (MASM/TASM follow a different 
convention.) In 32-bit mode, addresses are 32-bit. Here are some examples: 


mov al, [L1] ; copy byte at L1 into AL 

mov eax, L1 ; EAX = address of byte at L1 

mov [L1], ah ; copy AH into byte at L1 

mov eax, [L6] ; copy double word at L6 into EAX 

add eax, [L6] ; EAX = EAX + double word at L6 

add [L6], eax ; double word at L6 += EAX 

mov al, [L6] ; copy first byte of double word at L6 into AL 


Line 7 of the examples shows an important property of NASM. The assem- 

bler does not keep track of the type of data that a label refers to. It is up to 

the programmer to make sure that he (or she) uses a label correctly. Later 

it will be common to store addresses of data in registers and use the register 

like a pointer variable in C. Again, no checking is made that a pointer is 

used correctly. In this way, assembly is much more error prone than even C. 
Consider the following instruction: 


mov [L6], 1 ; store a 1 at L6 


This statement produces an operation size not specified error. Why? 
Because the assembler does not know whether to store the 1 as a byte, word 
or double word. To fix this, add a size specifier: 


mov dword [L6], 1 ; store a 1 at L6 


Single precision floating point is equivalent to a float variable in C. 
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This tells the assembler to store an 1 at the double word that starts at L6. 
Other size specifiers are: BYTE, WORD, QWORD and TWORD’. 


1.3.6 Input and Output 


Input and output are very system dependent activities. It involves in- 
terfacing with the system’s hardware. High level languages, like C, provide 
standard libraries of routines that provide a simple, uniform programming 
interface for I/O. Assembly languages provide no standard libraries. They 
must either directly access hardware (which is a privileged operation in pro- 
tected mode) or use whatever low level routines that the operating system 
provides. 

It is very common for assembly routines to be interfaced with C. One 
advantage of this is that the assembly code can use the standard C library 
I/O routines. However, one must know the rules for passing information 
between routines that C uses. These rules are too complicated to cover 
here. (They are covered later!) To simplify I/O, the author has developed 
his own routines that hide the complex C rules and provide a much more 
simple interface. Table 1.4 describes the routines provided. All of the rou- 
tines preserve the value of all registers, except for the read routines. These 
routines do modify the value of the EAX register. To use these routines, one 
must include a file with information that the assembler needs to use them. 
To include a file in NASM, use the %include preprocessor directive. The 
following line includes the file needed by the author’s I/O routines®: 


include "asm_io.inc" 


To use one of the print routines, one loads EAX with the correct value 
and uses a CALL instruction to invoke it. The CALL instruction is equivalent 
to a function call in a high level language. It jumps execution to another 
section of code, but returns back to its origin after the routine is over. 
The example program below shows several examples of calls to these I/O 
routines. 


1.3.7 Debugging 


The author’s library also contains some useful routines for debugging 
programs. These debugging routines display information about the state of 
the computer without modifying the state. These routines are really macros 


"TWORD defines a ten byte area of memory. The floating point coprocessor uses this 
data type. 

®The asm_io.inc (and the asm_io object file that asm_io.inc requires) 
are in the example code downloads on the web page for this tutorial, 
http://pacman128.github.io/pcasm/ 
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print_int 
print_char 
print_string 
print_nl 
read_int 


read_char 


prints out to the screen the value of the integer stored 
in EAX 

prints out to the screen the character whose ASCII 
value stored in AL 

prints out to the screen the contents of the string at 
the address stored in EAX. The string must be a C- 
type string (i.e. null terminated). 

prints out to the screen a new line character. 

reads an integer from the keyboard and stores it into 
the EAX register. 

reads a single character from the keyboard and stores 
its ASCII code into the EAX register. 


Table 1.4: Assembly I/O Routines 
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that preserve the current state of the CPU and then make a subroutine call. 


The macros are defined in the asm_io.inc file discussed above. 


Macros 


are used like ordinary instructions. Operands of macros are separated by 


commas. 


There are four debugging routines named dump_regs, dump_mem, dump_stack 


and dump_math; they display the values of registers, memory, stack and the 
math coprocessor, respectively. 


dump_regs This macro prints out the values of the registers (in hexadeci- 
mal) of the computer to stdout (i.e. the screen). It also displays the 


bits set in the FLAGS? register. For example, if the zero flag is 1, ZF 


is displayed. If it is 0, it is not displayed. It takes a single integer 
argument that is printed out as well. This can be used to distinguish 
the output of different dump_regs commands. 


dump_mem This macro prints out the values of a region of memory (in 
hexadecimal) and also as ASCII characters. It takes three comma 


delimited arguments. 


The first is an integer that is used to label 


the output (just as dump_regs argument). The second argument is 
the address to display. (This can be a label.) The last argument is 
the number of 16-byte paragraphs to display after the address. The 
memory displayed will start on the first paragraph boundary before 
the requested address. 


dump-_stack This macro prints out the values on the CPU stack. (The 
stack will be covered in Chapter 4.) The stack is organized as double 
words and this routine displays them this way. It takes three comma 


°Chapter 2 discusses this register 
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delimited arguments. The first is an integer label (like dump_regs). 
The second is the number of double words to display below the address 
that the EBP register holds and the third argument is the number of 
double words to display above the address in EBP. 


dump_math This macro prints out the values of the registers of the math 
coprocessor. It takes a single integer argument that is used to label 
the output just as the argument of dump_regs does. 


1.4 Creating a Program 


Today, it is unusual to create a stand alone program written completely 
in assembly language. Assembly is usually used to key certain critical rou- 
tines. Why? It is much easier to program in a higher level language than in 
assembly. Also, using assembly makes a program very hard to port to other 
platforms. In fact, it is rare to use assembly at all. 

So, why should anyone learn assembly at all? 


1. Sometimes code written in assembly can be faster and smaller than 
compiler generated code. 


2. Assembly allows access to direct hardware features of the system that 
might be difficult or impossible to use from a higher level language. 


3. Learning to program in assembly helps one gain a deeper understand- 
ing of how computers work. 


4. Learning to program in assembly helps one understand better how 
compilers and high level languages like C work. 


These last two points demonstrate that learning assembly can be useful 
even if one never programs in it later. In fact, the author rarely programs 
in assembly, but he uses the ideas he learned from it everyday. 


1.4.1 First program 


The early programs in this text will all start from the simple C driver 
program in Figure 1.6. It simply calls another function named asm_main. 
This is really a routine that will be written in assembly. There are several 
advantages in using the C driver routine. First, this lets the C system set 
up the program to run correctly in protected mode. All the segments and 
their corresponding segment registers will be initialized by C. The assembly 
code need not worry about any of this. Secondly, the C library will also be 
available to be used by the assembly code. The author’s I/O routines take 
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1 | int main() 
a l{ 
3 int ret_status ; 

4 ret_status = asm_main(); 
5 return ret_status ; 


e|} 


Figure 1.6: driver.c code 


advantage of this. They use C’s I/O functions (printf, etc.). The following 
shows a simple assembly program. 


first.asm 


; file: first.asm 
; First assembly program. This program asks for two integers as 
; input and prints out their sum. 


; To create executable using djgpp: 
; nasm -f coff first.asm 
; gcc -o first first.o driver.c asm_io.o 


include "asm_io.inc" 

; 

; initialized data is put in the .data segment 
; 

segment .data 

; 

; These labels refer to strings used for output 


2 


prompti db "Enter a number: ", O ; don’t forget null terminator 
prompt2 db "Enter another number: ", 0 

outmsgi db "You entered ", 0 

outmsg2 db "and ", 0 

outmsg3 db ", the sum of these is ", O 


; 

; uninitialized data is put in the .bss segment 

; 

segment .bss 

; 

; These labels refer to double words used to store the inputs 


$ 
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inputi resd 1 


input2 resd 1 


3 


; code is put in the .text segment 


eT 


segment .text 


global _asm_main 
_asm_main: 

enter 0,0 

pusha 

mov eax, promptl 

call print_string 

call read_int 

mov [input1], eax 

mov eax, prompt2 

call print_string 

call read_int 

mov Linput2], eax 

mov eax, [input1] 

add eax, [input2] 

mov ebx, eax 

dump_regs 1 


dump_mem 2, outmsgi, 1 
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setup routine 

print out prompt 
read integer 

store into inputl 
print out prompt 
read integer 

store into input2 

eax = dword at inputi 


eax += dword at input2 
ebx = eax 


; print out register values 
; print out memory 


; next print out result message as series of steps 


3’ 


mov eax, outmsg1 
call print_string 
mov eax, [input1] 
call print_int 

mov eax, outmsg2 
call print_string 
mov eax, [input2] 
call print_int 


mov eax, outmsg3 


A 


d 


2 


> 


print out first message 
print out inputi 
print out second message 


print out input2 


72 


73 


74 


75 


76 


77 


78 


79 


80 
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call print_string ; print out third message 
mov eax, ebx 

call print_int ; print out sum (ebx) 
call print_nl ; print new-line 

popa 

mov eax, 0 ; return back to C 

leave 

ret 


first.asm 


Line 13 of the program defines a section of the program that specifies 
memory to be stored in the data segment (whose name is .data). Only 
initialized data should be defined in this segment. On lines 17 to 21, several 
strings are declared. They will be printed with the C library and so must 
be terminated with a null character (ASCII code 0). Remember there is a 
big difference between 0 and ’0’. 


Uninitialized data should be declared in the bss segment (named .bss 
on line 26). This segment gets its name from an early UNIX-based assem- 
bler operator that meant “block started by symbol.” There is also a stack 
segment too. It will be discussed later. 


The code segment is named .text historically. It is where instructions 
are placed. Note that the code label for the main routine (line 38) has an 
underscore prefix. This is part of the C calling convention. This conven- 
tion specifies the rules C uses when compiling code. It is very important 
to know this convention when interfacing C and assembly. Later the en- 
tire convention will be presented; however, for now, one only needs to know 
that all C symbols (i.e., functions and global variables) have a underscore 
prefix appended to them by the C compiler. (This rule is specifically for 
DOS/Windows, the Linux C compiler does not prepend anything to C sym- 
bol names. ) 


The global directive on line 37 tells the assembler to make the -asm_main 
label global. Unlike in C, labels have internal scope by default. This means 
that only code in the same module can use the label. The global directive 
gives the specified label (or labels) external scope. This type of label can be 
accessed by any module in the program. The asm_io module declares the 
print_int, et.al. labels to be global. This is why one can use them in the 
first.asm module. 
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1.4.2 Compiler dependencies 


The assembly code above is specific to the free GNU!°-based DJGPP 
C/C++ compiler.!' This compiler can be freely downloaded from the In- 
ternet. It requires a 386-based PC or better and runs under DOS, Windows 
95/98 or NT. This compiler uses object files in the COFF (Common Object 
File Format) format. To assemble to this format use the -f coff switch 
with nasm (as shown in the comments of the above code). The extension of 
the resulting object file will be o. 

The Linux C compiler is a GNU compiler also. To convert the code 
above to run under Linux, simply remove the underscore prefixes in lines 37 
and 38. Linux uses the ELF (Executable and Linkable Format) format for 
object files. Use the -f elf switch for Linux. It also produces an object 
with an o extension. 

Borland C/C++ is another popular compiler. It uses the Microsoft 
OMF format for object files. Use the -f obj switch for Borland compilers. 
The extension of the object file will be obj. The OMF format uses differ- 
ent segment directives than the other object formats. The data segment 
(line 13) must be changed to: 


segment _DATA public align=4 class=DATA use32 
The bss segment (line 26) must be changed to: 

segment -BSS public align=4 class=BSS use32 
The text segment (line 36) must be changed to: 

segment _TEXT public align=1 class=CODE use32 
In addition a new line should be added before line 36: 

group DGROUP -BSS _DATA 


The Microsoft C/C++ compiler can use either the OMF format or the 
Win32 format for object files. (If given a OMF format, it converts the 
information to Win32 format internally.) Win32 format allows segments 
to be defined just as for DJGPP and Linux. Use the -f win32 switch to 
output in this mode. The extension of the object file will be obj. 


1.4.3 Assembling the code 
The first step is to assemble the code. From the command line, type: 


nasm -f object-format first.asm 


'OGNU is a project of the Free Software Foundation (http://www. fsf.org) 
“http: //waw.delorie.com/djgpp 
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where object-format is either coff, elf, obj or win32 depending on what C 
compiler will be used. (Remember that the source file must be changed for 
both Linux and Borland as well.) 


1.4.4 Compiling the C code 


Compile the driver.c file using a C compiler. For DJGPP, use: 
gcc -c driver.c 


The -c switch means to just compile, do not attempt to link yet. This same 
switch works on Linux, Borland and Microsoft compilers as well. 


1.4.5 Linking the object files 


Linking is the process of combining the machine code and data in object 
files and library files together to create an executable file. As will be shown 
below, this process is complicated. 

C code requires the standard C library and special startup code to run. 
It is much easier to let the C compiler call the linker with the correct pa- 
rameters, than to try to call the linker directly. For example, to link the 
code for the first program using DJGPP, use: 


gcc -o first driver.o first.o asm_io.o 


This creates an executable called first.exe (or just first under Linux). 
With Borland, one would use: 


bec32 first.obj driver.obj asm_io.obj 


Borland uses the name of the first file listed to determine the executable 
name. So in the above case, the program would be named first. exe. 
It is possible to combine the compiling and linking step. For example, 


gcc -o first driver.c first.o asm_io.o 


Now gcc will compile driver.c and then link. 


1.4.6 Understanding an assembly listing file 


The -1 Listing-file switch can be used to tell nasm to create a listing 
file of a given name. This file shows how the code was assembled. Here is 
how lines 17 and 18 (in the data segment) appear in the listing file. (The 
line numbers are in the listing file; however notice that the line numbers in 
the source file may not be the same as the line numbers in the listing file.) 
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48 00000000 456E7465722061206E- prompt1 db "Enter a number: ", 0 
49 00000009 756D6265723A2000 
50 00000011 456E74657220616E6F- prompt2 db "Enter another number: 


51 0000001A 74686572206E756D62- 
52 00000023 65723A2000 


The first column in each line is the line number and the second is the offset 
(in hex) of the data in the segment. The third column shows the raw hex 
values that will be stored. In this case the hex data correspond to ASCII 
codes. Finally, the text from the source file is displayed on the line. The 
offsets listed in the second column are very likely not the true offsets that 
the data will be placed at in the complete program. Each module may define 
its own labels in the data segment (and the other segments, too). In the link 
step (see section 1.4.5), all these data segment label definitions are combined 
to form one data segment. The new final offsets are then computed by the 
linker. 

Here is a small section (lines 54 to 56 of the source file) of the text 
segment in the listing file: 


94 0000002C A1[00000000] mov eax, [input1] 
95 00000031 0305 [04000000] add eax, [input2] 
96 00000037 89C3 mov ebx, eax 


The third column shows the machine code generated by the assembly. Often 
the complete code for an instruction can not be computed yet. For example, 
in line 94 the offset (or address) of input1 is not known until the code is 
linked. The assembler can compute the op-code for the mov instruction 
(which from the listing is Al), but it writes the offset in square brackets 
because the exact value can not be computed yet. In this case, a temporary 
offset of 0 is used because input1 is at the beginning of the part of the bss 
segment defined in this file. Remember this does not mean that it will be 
at the beginning of the final bss segment of the program. When the code 
is linked, the linker will insert the correct offset into the position. Other 
instructions, like line 96, do not reference any labels. Here the assembler 
can compute the complete machine code. 


Big and Little Endian Representation 


If one looks closely at line 95, something seems very strange about the 
offset in the square brackets of the machine code. The input2 label is at 
offset 4 (as defined in this file); however, the offset that appears in memory 
is not 00000004, but 04000000. Why? Different processors store multibyte 
integers in different orders in memory. There are two popular methods of 


Endian is pronounced like storing integers: big endian and little endian. Big endian is the method 


indian. 
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that seems the most natural. The biggest (i.e. most significant) byte is 
stored first, then the next biggest, etc. For example, the dword 00000004 
would be stored as the four bytes 00 00 00 04. IBM mainframes, most RISC 
processors and Motorola processors all use this big endian method. However, 
Intel-based processors use the little endian method! Here the least significant 
byte is stored first. So, 00000004 is stored in memory as 04 00 00 00. This 
format is hardwired into the CPU and can not be changed. Normally, the 
programmer does not need to worry about which format is used. However, 
there are circumstances where it is important. 


1. When binary data is transfered between different computers (either 
from files or through a network). 


2. When binary data is written out to memory as a multibyte integer 
and then read back as individual bytes or vice versa. 


Endianness does not apply to the order of array elements. The first 
element of an array is always at the lowest address. This applies to strings 
(which are just character arrays). Endianness still applies to the individual 
elements of the arrays. 


1.5 Skeleton File 


Figure 1.7 shows a skeleton file that can be used as a starting point for 
writing assembly programs. 
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skel.asm 


include "asm_io.inc" 

segment .data 

; 

; initialized data is put in the data segment here 


eT 


segment .bss 
; 
; uninitialized data is put in the bss segment 


eT 


segment .text 
global _asm_main 

_asm_main: 
enter 0,0 ; setup routine 
pusha 


; code is put in the text segment. Do not modify the code before 


; or after this comment. 


popa 


mov eax, 0 ; return back to C 


leave 


ret skel.asm 


Figure 1.7: Skeleton Program 


Chapter 2 


Basic Assembly Language 


2.1 Working with Integers 


2.1.1 Integer representation 


Integers come in two flavors: unsigned and signed. Unsigned integers 
(which are non-negative) are represented in a very straightforward binary 
manner. The number 200 as an one byte unsigned integer would be repre- 
sented as by 11001000 (or C8 in hex). 

Signed integers (which may be positive or negative) are represented in a 
more complicated ways. For example, consider —56. +56 as a byte would be 
represented by 00111000. On paper, one could represent —56 as —111000, 
but how would this be represented in a byte in the computer’s memory. How 
would the minus sign be stored? 

There are three general techniques that have been used to represent 
signed integers in computer memory. All of these methods use the most 
significant bit of the integer as a sign bit. This bit is 0 if the number is 
positive and 1 if negative. 


Signed magnitude 


The first method is the simplest and is called signed magnitude. It rep- 
resents the integer as two parts. The first part is the sign bit and the second 
is the magnitude of the integer. So 56 would be represented as the byte 
00111000 (the sign bit is underlined) and —56 would be 10111000. The 
largest byte value would be 01111111 or +127 and the smallest byte value 
would be 11111111 or —127. To negate a value, the sign bit is reversed. 
This method is straightforward, but it does have its drawbacks. First, there 
are two possible values of zero, +0 (00000000) and —0 (10000000). Since 
zero is neither positive nor negative, both of these representations should act 
the same. This complicates the logic of arithmetic for the CPU. Secondly, 
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general arithmetic is also complicated. If 10 is added to —56, this must be 
recast as 10 subtracted by 56. Again, this complicates the logic of the CPU. 


One’s complement 


The second method is known as one’s complement representation. The 
one’s complement of a number is found by reversing each bit in the number. 
(Another way to look at it is that the new bit value is 1 — oldbitvalue.) For 
example, the one’s complement of 00111000 (+56) is 11000111. In one’s com- 
plement notation, computing the one’s complement is equivalent to nega- 
tion. Thus, 11000111 is the representation for —56. Note that the sign bit 
was automatically changed by one’s complement and that as one would ex- 
pect taking the one’s complement twice yields the original number. As for 
the first method, there are two representations of zero: 00000000 (+0) and 
11111111 (—0). Arithmetic with one’s complement numbers is complicated. 

There is a handy trick to finding the one’s complement of a number in 
hexadecimal without converting it to binary. The trick is to subtract the hex 
digit from F (or 15 in decimal). This method assumes that the number of 
bits in the number is a multiple of 4. Here is an example: +56 is represented 
by 38 in hex. To find the one’s complement, subtract each digit from F to 
get C7 in hex. This agrees with the result above. 


Two’s complement 


The first two methods described were used on early computers. Modern 
computers use a third method called two’s complement representation. The 
two’s complement of a number is found by the following two steps: 


1. Find the one’s complement of the number 
2. Add one to the result of step 1 


Here’s an example using 00111000 (56). First the one’s complement is com- 
puted: 11000111. Then one is added: 


11000111 
F 1 
11001000 


In two complement’s notation, computing the two’s complement is equiv- 
alent to negating a number. Thus, 11001000 is the two’s complement rep- 
resentation of —56. Two negations should reproduce the original number. 
Surprising two’s complement does meet this requirement. Take the two’s 
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Number | Hex Representation 
0 00 
1 01 
127 7F 
-128 80 
-127 81 
-2 FE 
-1 FF 


Table 2.1: Two’s Complement Representation 


complement of 11001000 by adding one to the one’s complement. 


00110111 
+ 1 
00111000 


When performing the addition in the two’s complement operation, the 
addition of the leftmost bit may produce a carry. This carry is not used. 
Remember that all data on the computer is of some fixed size (in terms of 
number of bits). Adding two bytes always produces a byte as a result (just 
as adding two words produces a word, etc.) This property is important for 
two’s complement notation. For example, consider zero as a one byte two’s 
complement number (00000000). Computing its two complement produces 
the sum: 


11111111 
+ 1 
c 00000000 


where c represents a carry. (Later it will be shown how to detect this carry, 
but it is not stored in the result.) Thus, in two’s complement notation there 
is only one zero. This makes two’s complement arithmetic simpler than the 
previous methods. 

Using two’s complement notation, a signed byte can be used to represent 
the numbers —128 to +127. Table 2.1 shows some selected values. If 16 
bits are used, the signed numbers —32, 768 to +32, 767 can be represented. 
+32, 767 is represented by 7FFF, —32, 768 by 8000, -128 as FF80 and -1 as 
FFFF. 32 bit two’s complement numbers range from —2 billion to +2 billion 
approximately. 

The CPU has no idea what a particular byte (or word or double word) is 
supposed to represent. Assembly does not have the idea of types that a high 
level language has. How data is interpreted depends on what instruction is 
used on the data. Whether the hex value FF is considered to represent a 
signed —1 or a unsigned +255 depends on the programmer. The C language 
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defines signed and unsigned integer types. This allows a C compiler to 
determine the correct instructions to use with the data. 


2.1.2 Sign extension 


In assembly, all data has a specified size. It is not uncommon to need 
to change the size of data to use it with other data. Decreasing size is the 
easiest. 


Decreasing size of data 


To decrease the size of data, simply remove the more significant bits of 
the data. Here’s a trivial example: 


mov ax, 0034h ; ax = 52 (stored in 16 bits) 
mov cl, al ; cl = lower 8-bits of ax 


Of course, if the number can not be represented correctly in the smaller 
size, decreasing the size does not work. For example, if AX were 0134h (or 
308 in decimal) then the above code would still set CL to 34h. This method 
works with both signed and unsigned numbers. Consider signed numbers, 
if AX was FFFFh (—1 as a word), then CL would be FFh (—1 as a byte). 
However, note that this is not correct if the value in AX was unsigned! 

The rule for unsigned numbers is that all the bits being removed must 
be 0 for the conversion to be correct. The rule for signed numbers is that 
the bits being removed must be either all 1’s or all 0’s. In addition, the first 
bit not being removed must have the same value as the removed bits. This 
bit will be the new sign bit of the smaller value. It is important that it be 
same as the original sign bit! 


Increasing size of data 


Increasing the size of data is more complicated than decreasing. Consider 
the hex byte FF. If it is extended to a word, what value should the word 
have? It depends on how FF is interpreted. If FF is a unsigned byte (255 
in decimal), then the word should be 00FF; however, if it is a signed byte 
(—1 in decimal), then the word should be FFFF. 

In general, to extend an unsigned number, one makes all the new bits 
of the expanded number 0. Thus, FF becomes 0OFF. However, to extend 
a signed number, one must eztend the sign bit. This means that the new 
bits become copies of the sign bit. Since the sign bit of FF is 1, the new 
bits must also be all ones, to produce FFFF. If the signed number 5A (90 
in decimal) was extended, the result would be 005A. 
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There are several instructions that the 80386 provides for extension of 
numbers. Remember that the computer does not know whether a number is 
signed or unsigned. It is up to the programmer to use the correct instruction. 

For unsigned numbers, one can simply put zeros in the upper bits using 
a MOV instruction. For example, to extend the byte in AL to an unsigned 
word in AX: 


mov ah, 0 ; zero out upper 8-bits 


However, it is not possible to use a MOV instruction to convert the unsigned 
word in AX to an unsigned double word in EAX. Why not? There is no way 
to specify the upper 16 bits of EAX in a MOV. The 80386 solves this problem 
by providing a new instruction MOVZX. This instruction has two operands. 
The destination (first operand) must be a 16 or 32 bit register. The source 
(second operand) may be an 8 or 16 bit register or a byte or word of memory. 
The other restriction is that the destination must be larger than the source. 
(Most instructions require the source and destination to be the same size.) 
Here are some examples: 


movzxX eax, ax ; extends ax into eax 
movzx eax, al ; extends al into eax 
movzx ax, al ; extends al into ax 
movzx ebx, ax ; extends ax into ebx 


For signed numbers, there is no easy way to use the MOV instruction for 
any case. The 8086 provided several instructions to extend signed numbers. 
The CBW (Convert Byte to Word) instruction sign extends the AL register 
into AX. The operands are implicit. The CWD (Convert Word to Double 
word) instruction sign extends AX into DX:AX. The notation DX:AX means 
to think of the DX and AX registers as one 32 bit register with the upper 
16 bits in DX and the lower bits in AX. (Remember that the 8086 did not 
have any 32 bit registers!) The 80386 added several new instructions. The 
CWDE (Convert Word to Double word Extended) instruction sign extends 
AX into EAX. The CDQ (Convert Double word to Quad word) instruction 
sign extends EAX into EDX:EAX (64 bits!). Finally, the MOVSX instruction 
works like MOVZX except it uses the rules for signed numbers. 


Application to C programming 


Extending of unsigned and signed integers also occurs in C. Variables in 
C may be declared as either signed or unsigned (int is signed). Consider 
the code in Figure 2.1. In line 3, the variable a is extended using the rules 
for unsigned values (using MOVZX), but in line 4, the signed rules are used 
for b (using MOVSX). 


ANSI C does not define 
whether the char type is 
signed or not, it 1s up to 
each individual compiler to 
decide this. That is why 
the type is explicitly de- 
fined in Figure 2.1. 
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unsigned char uchar = OxFF; 
signed char schar = OxFF; 
int a = (int) uchar; /* a = 255 (OxO00000FF) «/ 
int b = (int) schar; /* b = —1 (OxFFFFFFFF) x/ 


Figure 2.1: 


char ch; 

while( (ch = fgetc(fp)) != EOF ) { 
/* do something with ch */ 

} 


Figure 2.2: 


There is a common C programming bug that directly relates to this 
subject. Consider the code in Figure 2.2. The prototype of fgetc (is: 


int fgetc( FILE * ); 


One might question why does the function return back an int since it reads 
characters? The reason is that it normally does return back an char (ex- 
tended to an int value using zero extension). However, there is one value 
that it may return that is not a character, EOF. This is a macro that is 
usually defined as —1. Thus, fgetc( either returns back a char extended 
to an int value (which looks like 000000zz in hex) or EOF (which looks like 
FFFFFFFF in hex). 

The basic problem with the program in Figure 2.2 is that fgetc() re- 
turns an int, but this value is stored in a char. C will truncate the higher 
order bits to fit the int value into the char. The only problem is that the 
numbers (in hex) OOOOOOFF and FFFFFFFF both will be truncated to the 
byte FF. Thus, the while loop can not distinguish between reading the byte 
FF from the file and end of file. 

Exactly what the code does in this case, depends on whether char is 
signed or unsigned. Why? Because in line 2, ch is compared with EOF. 
Since EOF is an int value!, ch will be extended to an int so that two values 
being compared are of the same size?. As Figure 2.1 showed, where the 
variable is signed or unsigned is very important. 

If char is unsigned, FF is extended to be 000000FF. This is compared to 
EOF (FFFFFFFF) and found to be not equal. Thus, the loop never ends! 


It is a common misconception that files have an EOF character at their end. This is 
not true! 
?The reason for this requirement will be shown later. 
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If char is signed, FF is extended to FFFFFFFF. This does compare as 
equal and the loop ends. However, since the byte FF may have been read 
from the file, the loop could be ending prematurely. 

The solution to this problem is to define the ch variable as an int, not a 
char. When this is done, no truncating or extension is done in line 2. Inside 
the loop, it is safe to truncate the value since ch must actually be a simple 
byte there. 


2.1.3 Two’s complement arithmetic 


As was seen earlier, the add instruction performs addition and the sub 
instruction performs subtraction. Two of the bits in the FLAGS register that 
these instructions set are the overflow and carry flag. The overflow flag is 
set if the true result of the operation is too big to fit into the destination 
for signed arithmetic. The carry flag is set if there is a carry in the msb 
of an addition or a borrow in the msb of a subtraction. Thus, it can be 
used to detect overflow for unsigned arithmetic. The uses of the carry flag 
for signed arithmetic will be seen shortly. One of the great advantages of 
2’s complement is that the rules for addition and subtraction are exactly the 
same as for unsigned integers. Thus, add and sub may be used on signed or 
unsigned integers. 


002C 44 
+ FFFF + (-1) 
002B 43 


There is a carry generated, but it is not part of the answer. 

There are two different multiply and divide instructions. First, to mul- 
tiply use either the MUL or IMUL instruction. The MUL instruction is used 
to multiply unsigned numbers and IMUL is used to multiply signed integers. 
Why are two different instructions needed? The rules for multiplication are 
different for unsigned and 2’s complement signed numbers. How so? Con- 
sider the multiplication of the byte FF with itself yielding a word result. 
Using unsigned multiplication this is 255 times 255 or 65025 (or FEO1 in 
hex). Using signed multiplication this is —1 times —1 or 1 (or 0001 in hex). 

There are several forms of the multiplication instructions. The oldest 
form looks like: 


mul source 


The source is either a register or a memory reference. It can not be an 
immediate value. Exactly what multiplication is performed depends on the 
size of the source operand. If the operand is byte sized, it is multiplied by 
the byte in the AL register and the result is stored in the 16 bits of AX. If 
the source is 16-bit, it is multiplied by the word in AX and the 32-bit result 
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dest | sourcel source2 Action 
reg/mem8 AX = AL*sourcel 
reg/mem16 DX:AX = AX*sourcel 
reg /mem32 EDX:EAX = EAX*sourcel 

regl6 | reg/mem16 dest *= sourcel 

reg32 | reg/mem32 dest *= sourcel 

reg16 immed8 dest *= immed8 

reg32 immed8 dest *= immed8 

regl6 | immed16 dest *= immed16 

reg32 | immed32 dest *= immed32 

regl6 | reg/mem16 | immed8 | dest = sourcel*source2 

reg32 | reg/mem32 | immed8 | dest = sourcel*source2 

regl6 | reg/mem16 | immed16 | dest = sourcel*source2 

reg32 | reg/mem32 | immed32 | dest = sourcel*source2 


Table 2.2: imul Instructions 


is stored in DX:AX. If the source is 32-bit, it is multiplied by EAX and the 
64-bit result is stored into EDX:EAX. 

The IMUL instruction has the same formats as MUL, but also adds some 
other instruction formats. There are two and three operand formats: 


imul dest, sourcel 
imul dest, sourcel, source2 


Table 2.2 shows the possible combinations. 
The two division operators are DIV and IDIV. They perform unsigned 
and signed integer division respectively. The general format is: 


div source 


If the source is 8-bit, then AX is divided by the operand. The quotient is 
stored in AL and the remainder in AH. If the source is 16-bit, then DX:AX 
is divided by the operand. The quotient is stored into AX and remainder 
into DX. If the source is 32-bit, then EDX:EAX is divided by the operand 
and the quotient is stored into EAX and the remainder into EDX. The IDIV 
instruction works the same way. There are no special IDIV instructions like 
the special IMUL ones. If the quotient is too big to fit into its register or the 
divisor is zero, the program is interrupted and terminates. A very common 
error is to forget to initialize DX or EDX before division. 

The NEG instruction negates its single operand by computing its two’s 
complement. Its operand may be any 8-bit, 16-bit, or 32-bit register or 
memory location. 
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2.1.4 Example program 


math.asm 

include "asm_io.inc" 
segment .data ; Output strings 
prompt db "Enter a number: ", 0 
square_msg db "Square of input is ", 0 
cube_msg db "Cube of input is ", 0 
cube25_msg db "Cube of input times 25 is ", 0 
quot_msg db "Quotient of cube/100 is ", 0 
rem_msg db "Remainder of cube/100 is ", 0 
neg_msg db "The negation of the remainder is ", 0 
segment .bss 
input resd 1 
segment .text 

global _asm_main 
_asm_main: 

enter 0,0 ; setup routine 

pusha 

mov eax, prompt 

call print_string 

call read_int 

mov [input], eax 

imul eax ; edx:eax = eax * eax 

mov ebx, eax ; save answer in ebx 

mov eax, square_msg 

call print_string 

mov eax, ebx 

call print_int 

call print_nl 

mov ebx, eax 

imul ebx, [input] ; ebx *= [input] 

mov eax, cube_msg 

call print_string 

mov eax, ebx 

call print_int 


call print_nl 
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imul 
mov 
call 
mov 
call 
call 


mov 
cdq 
mov 
idiv 
mov 
mov 
call 
mov 
call 
call 
mov 
call 
mov 
call 
call 


neg 
mov 
call 
mov 
call 
call 


popa 
mov 


leave 


ret 
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ecx, ebx, 25 
eax, cube25_msg 
print_string 


eax, ecx 
print_int 
print_nl 
eax, ebx 
ecx, 100 
ecx 

ecx, eax 


eax, quot_msg 
print_string 
eax, ecx 
print_int 
print_nl 
eax, rem_msg 
print_string 
eax, edx 
print_int 
print_nl 


edx 

eax, neg_msg 
print_string 
eax, edx 
print_int 
print_nl 


eax, 0 


math.asm 


; ecx = ebx*25 


; initialize edx by sign extension 
; can’t divide by immediate value 
; edx:eax / ecx 

; save quotient into ecx 


; negate the remainder 


; return back to C 


2.1.5 Extended precision arithmetic 


Assembly language also provides instructions that allow one to perform 
addition and subtraction of numbers larger than double words. These in- 
structions use the carry flag. As stated above, both the ADD and SUB instruc- 
tions modify the carry flag if a carry or borrow are generated, respectively. 
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This information stored in the carry flag can be used to add or subtract 
large numbers by breaking up the operation into smaller double word (or 
smaller) pieces. 
The ADC and SBB instructions use this information in the carry flag. The 
ADC instruction performs the following operation: 
operandi = operandi + carry flag + operand2 


The SBB instruction performs: 
operandi = operandi - carry flag - operand2 


How are these used? Consider the sum of 64-bit integers in EDX:EAX and 
EBX:ECX. The following code would store the sum in EDX:EAX: 


add eax, ecx ; add lower 32-bits 


adc edx, ebx ; add upper 32-bits and carry from previous sum 


Subtraction is very similar. The following code subtracts EBX:ECX from 
EDX:EAX: 


sub eax, ecx ; subtract lower 32-bits 
sbb edx, ebx ; subtract upper 32-bits and borrow 


For really large numbers, a loop could be used (see Section 2.2). For a 
sum loop, it would be convenient to use ADC instruction for every iteration 
(instead of all but the first iteration). This can be done by using the CLC 
(CLear Carry) instruction right before the loop starts to initialize the carry 
flag to 0. If the carry flag is 0, there is no difference between the ADD and 
ADC instructions. The same idea can be used for subtraction, too. 


2.2 Control Structures 


High level languages provide high level control structures (e.g., the if 
and while statements) that control the thread of execution. Assembly lan- 
guage does not provide such complex control structures. It instead uses the 
infamous goto and used inappropriately can result in spaghetti code! How- 
ever, it is possible to write structured assembly language programs. The 
basic procedure is to design the program logic using the familiar high level 
control structures and translate the design into the appropriate assembly 
language (much like a compiler would do). 


2.2.1 Comparisons 


Control structures decide what to do based on comparisons of data. In 
assembly, the result of a comparison is stored in the FLAGS register to be 
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used later. The 80x86 provides the CMP instruction to perform comparisons. 
The FLAGS register is set based on the difference of the two operands of 
the CMP instruction. The operands are subtracted and the FLAGS are set 
based on the result, but the result is not stored anywhere. If you need the 
result use the SUB instead of the CMP instruction. 

For unsigned integers, there are two flags (bits in the FLAGS register) 
that are important: the zero (ZF) and carry (CF) flags. The zero flag is 
set (1) if the resulting difference would be zero. The carry flag is used as a 
borrow flag for subtraction. Consider a comparison like: 


cmp vleft, vright 


The difference of vleft - vright is computed and the flags are set accord- 
ingly. If the difference of the of CMP is zero, vleft = vright, then ZF is set 
(i.e. 1) and the CF is unset (i.e. 0). If vleft > vright, then ZF is unset 
and CF is unset (no borrow). If vleft < vright, then ZF is unset and CF 
is set (borrow). 

For signed integers, there are three flags that are important: the zero 


Why does SF = OF if (ZF) flag, the overflow (OF) flag and the sign (SF) flag. The overflow flag 


vleft > vright? If there 
is no overflow, then the 
difference will have the 
correct value and must 
be non-negative. Thus, 
SF = OF = 0. However, 
if there is an overflow, the 
difference will not have the 
correct value (and in fact 
will be negative). Thus, 
SF = OF = 1. 


is set if the result of an operation overflows (or underflows). The sign flag 
is set if the result of an operation is negative. If vleft = vright, the ZF 
is set (just as for unsigned integers). If vleft > vright, ZF is unset and 
SF = OF. If vleft < vright, ZF is unset and SF 4 OF. 

Do not forget that other instructions can also change the FLAGS register, 
not just CMP. 


2.2.2 Branch instructions 


Branch instructions can transfer execution to arbitrary points of a pro- 
gram. In other words, they act like a goto. There are two types of branches: 
unconditional and conditional. An unconditional branch is just like a goto, 
it always makes the branch. A conditional branch may or may not make 
the branch depending on the flags in the FLAGS register. If a conditional 
branch does not make the branch, control passes to the next instruction. 

The JMP (short for jump) instruction makes unconditional branches. Its 
single argument is usually a code label to the instruction to branch to. The 
assembler or linker will replace the label with correct address of the in- 
struction. This is another one of the tedious operations that the assembler 
does to make the programmer’s life easier. It is important to realize that 
the statement immediately after the JMP instruction will never be executed 
unless another instruction branches to it! 

There are several variations of the jump instruction: 


SHORT This jump is very limited in range. It can only move up or down 
128 bytes in memory. The advantage of this type is that it uses less 
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JZ branches only if ZF is set 
JNZ branches only if ZF is unset 
JO branches only if OF is set 
JNO branches only if OF is unset 
JS branches only if SF is set 
JNS branches only if SF is unset 
JC branches only if CF is set 
JNC branches only if CF is unset 
JP branches only if PF is set 
JNP branches only if PF is unset 


Table 2.3: Simple Conditional Branches 


memory than the others. It uses a single signed byte to store the 
displacement of the jump. The displacement is how many bytes to 
move ahead or behind. (The displacement is added to EIP). To specify 
a short jump, use the SHORT keyword immediately before the label in 
the JMP instruction. 


NEAR This jump is the default type for both unconditional and condi- 
tional branches, it can be used to jump to any location in a seg- 
ment. Actually, the 80386 supports two types of near jumps. One 
uses two bytes for the displacement. This allows one to move up or 
down roughly 32,000 bytes. The other type uses four bytes for the 
displacement, which of course allows one to move to any location in 
the code segment. The four byte type is the default in 386 protected 
mode. The two byte type can be specified by putting the WORD keyword 
before the label in the JMP instruction. 


FAR This jump allows control to move to another code segment. This is a 
very rare thing to do in 386 protected mode. 


Valid code labels follow the same rules as data labels. Code labels are 
defined by placing them in the code segment in front of the statement they 
label. A colon is placed at the end of the label at its point of definition. The 
colon is not part of the name. 

There are many different conditional branch instructions. They also 
take a code label as their single operand. The simplest ones just look at a 
single flag in the FLAGS register to determine whether to branch or not. 
See Table 2.3 for a list of these instructions. (PF is the parity flag which 
indicates the odd or evenness of the number of bits set in the lower 8-bits 
of the result.) 

The following pseudo-code: 
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if ( EAX == 0 ) 


EBX = 1; 
else 
EBX = 2; 


could be written in assembly as: 


cmp eax, 0 ; set flags (ZF set if eax - 0 = 0) 

jz thenblock ; if ZF is set branch to thenblock 

mov ebx, 2 ; ELSE part of IF 

jmp next ; jump over THEN part of IF 
thenblock: 

mov ebx, 1 ; THEN part of IF 


next: 


Other comparisons are not so easy using the conditional branches in 
Table 2.3. To illustrate, consider the following pseudo-code: 


if ( EAX >= 5 ) 


EBX = 1; 
else 
EBX = 2; 


If EAX is greater than or equal to five, the ZF may be set or unset and 
SF will equal OF. Here is assembly code that tests for these conditions 
(assuming that EAX is signed): 


cmp eax, 5 

js signon ; goto signon if SF = 1 

jo elseblock ; goto elseblock if OF = 1 and SF = 0 

jmp thenblock ; goto thenblock if SF = 0 and OF = 0 
signon: 

jo thenblock ; goto thenblock if SF = 1 and OF = 
elseblock: 

mov ebx, 2 

jmp next 
thenblock: 

mov ebx, 1 


next: 


The above code is very awkward. Fortunately, the 80x86 provides addi- 
tional branch instructions to make these type of tests much easier. There 
are signed and unsigned versions of each. Table 2.4 shows these instruc- 
tions. The equal and not equal branches (JE and JNE) are the same for 
both signed and unsigned integers. (In fact, JE and JNE are really identical 
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Signed 


Unsigned 


JE 

JNE 

JL, JNGE 
JLE, JNG 
JG, JNLE 
JGE, JNL 


branches if vleft 
branches if vleft 
branches if vleft 
branches if vleft 
branches if vleft 
branches if vleft 


A 
< 
< 
> 
2 


vright 
vright 
vright 
vright 
vright 
vright 


JE 
JNE 

JB, JNAE 
JBE, JNA 
JA, JNBE 
JAE, JNB 


branches if vleft 
branches if vleft 
branches if vleft 
branches if vleft 
branches if vleft 
branches if vleft 


= vright 
Æ vright 
< vright 
< vright 
> vright 
= vright 


Table 2.4: Signed and Unsigned Comparison Instructions 


to JZ and JNZ, respectively.) Each of the other branch instructions have 
two synonyms. For example, look at JL (jump less than) and JNGE (jump 
not greater than or equal to). These are the same instruction because: 


x < y = not(z > y) 


The unsigned branches use A for above and B for below instead of L and G. 
Using these new branch instructions, the pseudo-code above can be 
translated to assembly much easier. 


cmp eax, 5 

jge thenblock 

mov ebx, 2 

jmp next 
thenblock: 

mov ebx, 1 


next: 


2.2.3. The loop instructions 


The 80x86 provides several instructions designed to implement for-like 
loops. Each of these instructions takes a code label as its single operand. 


LOOP Decrements ECX, if ECX 4 0, branches to label 


LOOPE, LOOPZ Decrements ECX (FLAGS register is not modified), if 
ECX 4 0 and ZF = 1, branches 


LOOPNE, LOOPNZ Decrements ECX (FLAGS unchanged), if ECX # 
0 and ZF = 0, branches 


The last two loop instructions are useful for sequential search loops. The 
following pseudo-code: 
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sum = 0; 
for( i=10; i >0; i-— ) 
sum += i; 


could be translated into assembly as: 


mov eax, 0 ; eax is sum 

mov ecx, 10 ; ecx is i 
loop_start: 

add eax, ecx 


loop loop_start 


2.3 Translating Standard Control Structures 


This section looks at how the standard control structures of high level 
languages can be implemented in assembly language. 


2.3.1 If statements 
The following pseudo-code: 


if ( condition ) 
then_block; 
else 
else_block ; 


could be implemented as: 


; code to set FLAGS 
jxx else_block ; select xx so that branches if condition false 
; code for then block 
jmp endif 
else_block: 
; code for else block 
endif: 


If there is no else, then the else_block branch can be replaced by a 
branch to endif. 


; code to set FLAGS 
jxx endif ; select xx so that branches if condition false 
; code for then block 

endif: 
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2.3.2 While loops 
The while loop is a top tested loop: 


while( condition ) { 
body of loop; 


} 


This could be translated into: 


while: 
; code to set FLAGS based on condition 
jxx endwhile ; select xx so that branches if false 
; body of loop 
jmp while 
endwhile: 


2.3.3 Do while loops 
The do while loop is a bottom tested loop: 


do { 
body of loop; 
} while( condition ); 


This could be translated into: 


do: 
; body of loop 
; code to set FLAGS based on condition 
jxx do ; select xx so that branches if true 


2.4 Example: Finding Prime Numbers 


This section looks at a program that finds prime numbers. Recall that 
prime numbers are evenly divisible by only 1 and themselves. There is no 
formula for doing this. The basic method this program uses is to find the 
factors of all odd numbers? below a given limit. If no factor can be found for 
an odd number, it is prime. Figure 2.3 shows the basic algorithm written in 


C. 


Here’s the assembly version: 


39 is the only even prime number. 
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1 unsigned guess; /* current guess for prime */ 
2 | unsigned factor; /* possible factor of guess */ 


3 unsigned limit; /* find primes up to this value x/ 


printf ("Find primes up to: ”); 

6 | scanf("%u", &limit); 

printf ("2\n"); /* treat first two primes as */ 
printf ("3\n"); = /* special case */ 
ə | guess = 5; /* initial guess */ 

10 | while ( guess <= limit ) { 


/* look for a factor of guess */ 
factor = 3; 
while ( factor «factor < guess && 
guess % factor != 0 ) 
factor += 2; 
if ( guess % factor != 0 ) 
printf (" %d\n”, guess); 
guess += 2; /* only look at odd numbers */ 


Figure 2.3: 


prime.asm 


include "asm_io.inc" 


segment .data 
Message db "Find primes up to: ", 0 
segment .bss 
Limit resd 1 ; find primes up to this limit 
Guess resd 1 ; the current guess for prime 
segment .text 
global _asm_main 
_asm_main: 
enter 0,0 ; setup routine 
pusha 
mov eax, Message 
call print_string 
call read_int ; scanf("%u", & limit ); 


mov [Limit], eax 
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mov 
call 
call 
mov 

call 
call 


mov 

while_limit: 
mov 
cmp 
jnbe 


mov 

while_factor: 
mov 
mul 
jo 
cmp 
jnb 
mov 
mov 
div 
cmp 
je 


add 
jmp 


eax, 2 
print_int 
print_nl 
eax, 3 
print_int 
print_nl 


dword [Guess], 5 
eax, [Guess] 


eax, [Limit] 
end_while_limit 


ebx, 3 
eax,ebx 
eax 


end_while_factor 
eax, [Guess] 
end_while_factor 
eax, [Guess] 
edx , 0 

ebx 

edx, 0 
end_while_factor 


ebx, 2 
while_factor 


end_while_factor: 


je 

mov 

call 

call 
end_if: 

add 

jmp 
end_while_limit: 


popa 
mov 
leave 
ret 


end_if 

eax, [Guess] 
print_int 
print_nl 


dword [Guess], 2 
while_limit 


eax, 0 


3 


prime.asm 


printf ("2\n") ; 


printf ("3\n") ; 


Guess = 5; 

while ( Guess <= Limit ) 

use jnbe since numbers are unsigned 
ebx is factor = 3; 

edx:eax = eax*eax 

if answer won’t fit in eax alone 


if !(factor*factor < guess) 


edx = edx:eax % ebx 
if !(guess % factor != 0) 


factor += 2; 


if !(guess % factor != 0) 
printf ("%u\n") 


guess += 2 


; return back to C 
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Chapter 3 


Bit Operations 


3.1 Shift Operations 


Assembly language allows the programmer to manipulate the individual 
bits of data. One common bit operation is called a shift. A shift operation 
moves the position of the bits of some data. Shifts can be either toward 
the left (i.e. toward the most significant bits) or toward the right (the least 
significant bits). 


3.1.1 Logical shifts 


A logical shift is the simplest type of shift. It shifts in a very straightfor- 
ward manner. Figure 3.1 shows an example of a shifted single byte number. 


Original 1T}/1/}/1/0/1/0/1)0 
Left shifted 1/1/;/0;1]0); 1 
Right shifted | O0/1]1{]1/0)1]0)]1 


Figure 3.1: Logical shifts 


Note that new, incoming bits are always zero. The SHL and SHR instruc- 
tions are used to perform logical left and right shifts respectively. These 
instructions allow one to shift by any number of positions. The number of 
positions to shift can either be a constant or can be stored in the CL register. 
The last bit shifted out of the data is stored in the carry flag. Here are some 
code examples: 


mov ax, 0C123H 


shl ax, 1 ; shift 1 bit to left, ax = 8246H, CF 
shr ax, 1 ; shift 1 bit to right, ax = 4123H, CF 
shr ax, 1 ; shift 1 bit to right, ax = 2091H, CF 


mov ax, 0C123H 
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shl ax, 2 ; shift 2 bits to left, ax = O48CH, CF 
mov Els 
shr ax, cl ; shift 3 bits to right, ax = 0091H, CF 


3.1.2 Use of shifts 


Fast multiplication and division are the most common uses of a shift 
operations. Recall that in the decimal system, multiplication and division 
by a power of ten are simple, just shift digits. The same is true for powers 
of two in binary. For example, to double the binary number 1011, (or 11 
in decimal), shift once to the left to get 101102 (or 22). The quotient of a 
division by a power of two is the result of a right shift. To divide by just 2, 
use a single right shift; to divide by 4 (27), shift right 2 places; to divide by 
8 (2°), shift 3 places to the right, etc. Shift instructions are very basic and 
are much faster than the corresponding MUL and DIV instructions! 

Actually, logical shifts can be used to multiply and divide unsigned val- 
ues. They do not work in general for signed values. Consider the 2-byte 
value FFFF (signed —1). If it is logically right shifted once, the result is 
TFFF which is +32, 767! Another type of shift can be used for signed values. 


3.1.3 Arithmetic shifts 


These shifts are designed to allow signed numbers to be quickly multi- 
plied and divided by powers of 2. They insure that the sign bit is treated 
correctly. 


SAL Shift Arithmetic Left - This instruction is just a synonym for SHL. It 
is assembled into the exactly the same machine code as SHL. As long 
as the sign bit is not changed by the shift, the result will be correct. 


SAR Shift Arithmetic Right - This is a new instruction that does not shift 
the sign bit (i.e. the msb) of its operand. The other bits are shifted 
as normal except that the new bits that enter from the left are copies 
of the sign bit (that is, if the sign bit is 1, the new bits are also 1). 
Thus, if a byte is shifted with this instruction, only the lower 7 bits 
are shifted. As for the other shifts, the last bit shifted out is stored in 
the carry flag. 


mov ax, 0C123H 


sal ax, 1 ; ax = 8246H, CF = 1 
sal ax, 1 ; ax = O48CH, CF = 1 
sar ax, 2 ; ax = 0123H, CF = 0 
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3.1.4 Rotate shifts 
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The rotate shift instructions work like logical shifts except that bits lost 
off one end of the data are shifted in on the other side. Thus, the data is 
treated as if it is a circular structure. The two simplest rotate instructions 
are ROL and ROR which make left and right rotations, respectively. Just as 
for the other shifts, these shifts leave the a copy of the last bit shifted around 
in the carry flag. 


mov 
rol 
rol 
rol 
ror 
ror 


ax, 
ax, 
ax, 
ax, 
ax, 
ax, 


0C123H 


1 


e.e Nere 


; ax 
; ax 
; ax 
; ax 
; ax 


8247H, 
O48FH, 
O91EH, 
8247H, 
C123H, 


CF = 
CF = 
CF = 
CF 
CF 


PRPORPF, 


There are two additional rotate instructions that shift the bits in the 
data and the carry flag named RCL and RCR. For example, if the AX register 
is rotated with these instructions, the 17-bits made up of AX and the carry 
flag are rotated. 


mov 
cle 
rel 
rel 
rel 
rer 
rer 


ax, 


ax, 
ax, 
ax, 
ax, 
ax, 


0C123H 


PNPPRP Pe 


5 ax 
5 ax 
; ax 
; ax 


3.1.5 Simple application 


8246H, 
048DH, 
091BH, 
8246H, 
C123H, 


Here is a code snippet that counts the 


(i.e. 1) in the EAX register. 


; clear the carry flag (CF = 0) 
; ax 


CF = 1 
CF = 1 
CF = 0 
CF = 1 
CF = 0 


number of bits that are “on” 


mov 
mov 


count_loop: 


shl 
jnc 
inc 
skip_inc: 
loop 


bl, 


ecx, 32 


eax, 


skip_inc 


bl 


count_loop 


(0) 


1 


; bl will contain the count of ON bits 
; ecx is the loop counter 


; shift bit into carry flag 


if CF == 0, goto skip_inc 
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X|Y |X ANDY 
0 | 0 0 
0; 1 0 
1 | 0 0 
Paal 1 


Table 3.1: The AND operation 


1 01 0 1 0 1 Q0 
AND 11001001 
1 0 0 0 1 0 0 0 


Figure 3.2: ANDing a byte 


The above code destroys the original value of EAX (EAX is zero at the end of 
the loop). If one wished to retain the value of EAX, line 4 could be replaced 
with rol eax, 1. 


3.2 Boolean Bitwise Operations 


There are four common boolean operators: AND, OR, XOR and NOT. 
A truth table shows the result of each operation for each possible value of 
its operands. 


3.2.1 The AND operation 


The result of the AND of two bits is only 1 if both bits are 1, else the 
result is 0 as the truth table in Table 3.1 shows. 

Processors support these operations as instructions that act indepen- 
dently on all the bits of data in parallel. For example, if the contents of AL 
and BL are ANDed together, the basic AND operation is applied to each of 
the 8 pairs of corresponding bits in the two registers as Figure 3.2 shows. 
Below is a code example: 


mov ax, 0C123H 
and ax, 82F6H ; ax = 8022H 
3.2.2 The OR operation 


The inclusive OR of 2 bits is 0 only if both bits are 0, else the result is 
1 as the truth table in Table 3.2 shows. Below is a code example: 


mov ax, 0C123H 
or ax, OE831H ; ax = E933H 
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XIY|XORY| 
0/0 0 
0} 1 1 
1] 0 1 
1] 1 1 


Table 3.2: The OR operation 


X | Y |X XOR Y 
0 | 0 0 
0} 1 1 
1 | 0 1 
1} 1 0 


Table 3.3: The XOR operation 


3.2.3 The XOR operation 


The exclusive OR of 2 bits is 0 if and only if both bits are equal, else the 
result is 1 as the truth table in Table 3.3 shows. Below is a code example: 


mov ax, 0C123H 
xor ax, OE831H ; ax = 2912H 


3.2.4 The NOT operation 


The NOT operation is a unary operation (i.e. it acts on one operand, 
not two like binary operations such as AND). The NOT of a bit is the 
opposite value of the bit as the truth table in Table 3.4 shows. Below is a 
code example: 


mov ax, 0C123H 
not ax ; ax = 3EDCH 


Note that the NOT finds the one’s complement. Unlike the other bitwise 
operations, the NOT instruction does not change any of the bits in the FLAGS 
register. 


3.2.5 The TEST instruction 


The TEST instruction performs an AND operation, but does not store 
the result. It only sets the FLAGS register based on what the result would 
be (much like how the CMP instruction performs a subtraction but only sets 
FLAGS). For example, if the result would be zero, ZF would be set. 
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NOT X 
0 1 
1 0 


Table 3.4: The NOT operation 


Turn on bit 7 OR the number with 2f (which is the binary 
number with just bit į on) 
Turn off bit i AND the number with the binary number 


with only bit i off. This operand is often 
called a mask 
Complement bit i XOR the number with 2’ 


Table 3.5: Uses of boolean operations 


3.2.6 Uses of bit operations 


Bit operations are very useful for manipulating individual bits of data 
without modifying the other bits. Table 3.5 shows three common uses of 
these operations. Below is some example code, implementing these ideas. 


mov ax, 0C123H 


or ax, 8 ; turn on bit 3, ax = C12BH 
and ax, OFFDFH ; turn off bit 5, ax = C10BH 
xor ax, 8000H ; invert bit 15, ax = 410BH 
or ax, OFOOH ; turn on nibble, ax = 4FOBH 
and ax, OFFFOH ; turn off nibble, ax = 4F00H 
xor ax, OFOOFH ; invert nibbles, ax = BFOFH 
xor ax, OFFFFH ; 1’s complement, ax = 40FOH 


The AND operation can also be used to find the remainder of a division 
by a power of two. To find the remainder of a division by 2‘, AND the 
number with a mask equal to 2’— 1. This mask will contain ones from bit 0 
up to bit i — 1. It is just these bits that contain the remainder. The result 
of the AND will keep these bits and zero out the others. Next is a snippet 
of code that finds the quotient and remainder of the division of 100 by 16. 


mov eax, 100 ; 100 = 64H 
mov ebx, OOOOOOOFH ; mask = 16 - 1 = 15 or F 
and ebx, eax ; ebx = remainder = 4 


Using the CL register it is possible to modify arbitrary bits of data. Next is 
an example that sets (turns on) an arbitrary bit in EAX. The number of the 
bit to set is stored in BH. 
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count_loop: 


mov bl, O ; bl will contain the count of ON bits 
mov ecx, 32 ; ecx is the loop counter 

shl eax, 1 ; shift bit into carry flag 

adc bl, O ; add just the carry flag to bl 


loop count_loop 


Figure 3.3: Counting bits with ADC 


mov cl, bh ; first build the number to OR with 
mov ebx, 1 

shl ebx, cl ; shift left cl times 

or eax, ebx ; turn on bit 


Turning a bit off is just a little harder. 


mov cl, bh ; first build the number to AND with 
mov ebx, 1 

shl ebx, cl ; shift left cl times 

not ebx ; invert bits 

and eax, ebx ; turn off bit 


Code to complement an arbitrary bit is left as an exercise for the reader. 
It is not uncommon to see the following puzzling instruction in a 80x86 
program: 


xor eax, eax ; eax = 0 


A number XOR’ed with itself always results in zero. This instruction is used 
because its machine code is smaller than the corresponding MOV instruction. 


3.3 Avoiding Conditional Branches 


Modern processors use very sophisticated techniques to execute code as 
quickly as possible. One common technique is known as speculative execu- 
tion. This technique uses the parallel processing capabilities of the CPU to 
execute multiple instructions at once. Conditional branches present a prob- 
lem with this idea. The processor, in general, does not know whether the 
branch will be taken or not. If it is taken, a different set of instructions will 
be executed than if it is not taken. Processors try to predict whether the 
branch will be taken. If the prediciton is wrong, the processor has wasted 
its time executing the wrong code. 
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One way to avoid this problem is to avoid using conditional branches 
when possible. The sample code in 3.1.5 provides a simple example of where 
one could do this. In the previous example, the “on” bits of the EAX register 
are counted. It uses a branch to skip the INC instruction. Figure 3.3 shows 
how the branch can be removed by using the ADC instruction to add the 
carry flag directly. 

The SETzz instructions provide a way to remove branches in certain 
cases. These instructions set the value of a byte register or memory location 
to zero or one based on the state of the FLAGS register. The characters 
after SET are the same characters used for conditional branches. If the 
corresponding condition of the SETzz is true, the result stored is a one, if 
false a zero is stored. For example, 


setz al ; AL = 1 if Z flag is set, else 0 


Using these instructions, one can develop some clever techniques that cal- 
culate values without branches. 

For example, consider the problem of finding the maximum of two values. 
The standard approach to solving this problem would be to use a CMP and use 
a conditional branch to act on which value was larger. The example program 
below shows how the maximum can be found without any branches. 


; file: max.asm 

‘include "asm_io.inc" 

segment .data 

messagel db "Enter a number: ",0 
message2 db "Enter another number: ", 0 
message3 db "The larger number is: ", 0 
segment .bss 


inputi resd 1 ; first number entered 


segment .text 


global _asm_main 


_asm_main: 


enter 0,0 ; setup routine 

pusha 

mov eax, messagel ; print out first message 
call print_string 


call read_int ; input first number 
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mov 


mov 
call 
call 


xor 
cmp 
setg 
neg 
mov 
and 
not 
and 
or 


mov 
call 
mov 

call 
call 


popa 
mov 
leave 
ret 


[input1], eax 


eax, message2 
print_string 


read_int 

ebx, ebx 

eax, [inputi] 
bl 

ebx 

ecx, ebx 

ecx, eax 

ebx 

ebx, [inputi] 
ecx, ebx 


eax, message3 
print_string 


eax, 


ecx 


print_int 
print_nl 


eax, 


0 


; print out second message 
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input second number (in eax) 


compare second and first number 


> 
> 
> 
> 
> 
> 
> 


; ebx = 0 

; ebx = (Cinput2 

; ebx = (Cinput2 

; ecx = (Cinput2 

; ecx = (Cinput2 
ebx = (input2 
ebx = (input2 
ecx = (input2 
print out result 


; return back to C 


input1) 
input1) 
input1) 
input1) 
input1) 
input1) 
input1) 


NN NNN NN 


1: 
OxFFFFFFFF : 
OxFFFFFFFF : 

input2 : 

0: 

O: 

input2 : 


The trick is to create a bit mask that can be used to select the correct 
value for the maximum. The SETG instruction in line 30 sets BL to 1 if the 
second input is the maximum or 0 otherwise. This is not quite the bit mask 
desired. To create the required bit mask, line 31 uses the NEG instruction 
on the entire EBX register. (Note that EBX was zeroed out earlier.) If 
EBX is 0, this does nothing; however, if EBX is 1, the result is the two’s 
complement representation of -1 or OXFFFFFFFF. This is just the bit mask 
required. The remaining code uses this bit mask to select the correct input 
as the maximum. 


An alternative trick is to use the DEC statement. In the above code, if the 
NEG is replaced with a DEC, again the result will either be 0 or OXFFFFFFFF. 
However, the values are reversed than when using the NEG instruction. 


O O- OQO 


OxFFFFFFFF 
inputi 
inputi 
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3.4 Manipulating bits in C 


3.4.1 The bitwise operators of C 


Unlike some high-level languages, C does provide operators for bitwise 
operations. The AND operation is represented by the binary & operator. 
The OR operation is represented by the binary | operator. The XOR oper- 
ation is represented by the binary ^ operator. And the NOT operation is 
represented by the unary ~ operator. 

The shift operations are performed by C’s << and >> binary operators. 
The << operator performs left shifts and the >> operator performs right 
shifts. These operators take two operands. The left operand is the value to 
shift and the right operand is the number of bits to shift by. If the value 
to shift is an unsigned type, a logical shift is made. If the value is a signed 
type (like int), then an arithmetic shift is used. Below is some example C 
code using these operators: 


short int s; /* assume that short int is 16—bit «/ 
short unsigned u; 


s=-l; /* s = OxFFFF (2's complement) «/ 
u = 100; /* u = 0x0064 x / 

u = u | 0x0100; /* u = 0x0164 x/ 

s = s & OxFFFO; /* s = 0xFFFO0 */ 

s=s^u; /* s = OxFE94 x/ 

u =u << 3; /* u = 0x0B20 (logical shift ) */ 

s =s >> 2; /* s = 0xFFA5 (arithmetic shift) */ 


3.4.2 Using bitwise operators in C 


The bitwise operators are used in C for the same purposes as they are 
used in assembly language. They allow one to manipulate individual bits of 
data and can be used for fast multiplication and division. In fact, a smart 
C compiler will use a shift for a multiplication like, x *= 2, automatically. 

Many operating system API?’s (such as POSIX? and Win32) contain 
functions which use operands that have data encoded as bits. For example, 
POSIX systems maintain file permissions for three different types of users: 
user (a better name would be owner), group and others. Each type of 
user can be granted permission to read, write and/or execute a file. To 
change the permissions of a file requires the C programmer to manipulate 
individual bits. POSIX defines several macros to help (see Table 3.6). The 


'This operator is different from the binary && and unary & operators! 

? Application Programming Interface 

3stands for Portable Operating System Interface for Computer Environments. A stan- 
dard developed by the IEEE based on UNIX. 
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Macro Meaning 
S_IRUSR | user can read 
S_IWUSR | user can write 
S_IXUSR | user can execute 
S_IRGRP | group can read 
S_IWGRP | group can write 
S_IXGRP | group can execute 
S_IROTH | others can read 
S_IWOTH | others can write 
S_IXOTH | others can execute 


Table 3.6: POSIX File Permission Macros 


chmod function can be used to set the permissions of file. This function takes 
two parameters, a string with the name of the file to act on and an integer* 
with the appropriate bits set for the desired permissions. For example, the 
code below sets the permissions to allow the owner of the file to read and 
write to it, users in the group to read the file and others have no access. 


chmod(" foo”, S_IRUSR | S_IWUSR | S_IRGRP ); 


The POSIX stat function can be used to find out the current permission 
bits for the file. Used with the chmod function, it is possible to modify some 
of the permissions without changing others. Here is an example that removes 
write access to others and adds read access to the owner of the file. The 
other permissions are not altered. 


struct stat file_stats ; /* struct used by stat () */ 
stat("foo”, & file-stats ); /* read file info. 

file.stats .st_mode holds permission bits */ 
chmod(" foo”, ( file stats .st_mode & ~S_IWOTH) | S_IRUSR); 


3.5 Big and Little Endian Representations 


Chapter 1 introduced the concept of big and little endian representations 
of multibyte data. However, the author has found that this subject confuses 
many people. This section covers the topic in more detail. 

The reader will recall that endianness refers to the order that the in- 
dividual bytes (not bits) of a multibyte data element is stored in memory. 
Big endian is the most straightforward method. It stores the most signif- 
icant byte first, then the next significant byte and so on. In other words 
the big bits are stored first. Little endian stores the bytes in the opposite 


“Actually a parameter of type mode_t which is a typedef to an integral type. 
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unsigned short word = 0x1234; /* assumes sizeof (short) == 2 */ 
unsigned char * p = (unsigned char «) &word; 


if ( p[0] == 0x12 ) 

printf (” Big Endian Machine\n’” ); 
else 

printf (” Little Endian Machine\n’ ); 


Figure 3.4: How to Determine Endianness 


order (least significant first). The x86 family of processors use little endian 
representation. 


As an example, consider the double word representing 1234567816. In 
big endian representation, the bytes would be stored as 12 34 56 78. In little 
endian represenation, the bytes would be stored as 78 56 34 12. 


The reader is probably asking himself right now, why any sane chip de- 
signer would use little endian representation? Were the engineers at Intel 
sadists for inflicting this confusing representations on multitudes of program- 
mers? It would seem that the CPU has to do extra work to store the bytes 
backward in memory like this (and to unreverse them when read back in 
to memory). The answer is that the CPU does not do any extra work to 
write and read memory using little endian format. One has to realize that 
the CPU is composed of many electronic circuits that simply work on bit 
values. The bits (and bytes) are not in any necessary order in the CPU. 


Consider the 2-byte AX register. It can be decomposed into the single 
byte registers: AH and AL. There are circuits in the CPU that maintain the 
values of AH and AL. Circuits are not in any order in a CPU. That is, the 
circuits for AH are not before or after the circuits for AL. A mov instruction 
that copies the value of AX to memory copies the value of AL then AH. This 
is not any harder for the CPU to do than storing AH first. 


The same argument applies to the individual bits in a byte. They are not 
really in any order in the circuits of the CPU (or memory for that matter). 
However, since individual bits can not be addressed in the CPU or memory, 
there is no way to know (or care about) what order they seem to be kept 
internally by the CPU. 


The C code in Figure 3.4 shows how the endianness of a CPU can be 
determined. The p pointer treats the word variable as a two element char- 
acter array. Thus, p[0] evaluates to the first byte of word in memory which 
depends on the endianness of the CPU. 
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unsigned invert_endian( unsigned x ) 


{ 


unsigned invert; 
const unsigned char x xp = (const unsigned char *) &x; 
unsigned char x ip = (unsigned char «) & invert; 


ip [0] = xp[3]; /* reverse the individual bytes */ 


ip [1] = xp[2]; 
ip [2] = xp [1]; 
ip [3] = xp [0]; 


return invert; /% return the bytes reversed «/ 


Figure 3.5: invert_endian Function 


3.5.1 When to Care About Little and Big Endian 


For typical programming, the endianness of the CPU is not significant. 
The most common time that it is important is when binary data is trans- 
ferred between different computer systems. This is usually either using some 
type of physical data media (such as a disk) or a network. Since ASCII data 
is single byte, endianness is not an issue for it. 

All internal TCP/IP headers store integers in big endian format (called 
network byte order). TCP/IP libraries provide C functions for dealing with 
endianness issues in a portable way. For example, the htonl() function con- 
verts a double word (or long integer) from host to network format. The 
ntohl() function performs the opposite transformation.” For a big endian 
system, the two functions just return their input unchanged. This allows 
one to write network programs that will compile and run correctly on any 
system irrespective of its endianness. For more information, about endi- 
anness and network programming see W. Richard Steven’s excellent book, 
UNIX Network Programming. 

Figure 3.5 shows a C function that inverts the endianness of a double 
word. The 486 processor introduced a new machine instruction named BSWAP 
that reverses the bytes of any 32-bit register. For example, 


bswap edx ; swap bytes of edx 


The instruction can not be used on 16-bit registers. However, the XCHG 


5 Actually, reversing the endianness of an integer simply reverses the bytes; thus, con- 
verting from big to little or little to big is the same operation. So both of these functions 
do the same thing. 


With the advent of multi- 
byte character sets, like 
UNICODE, endianness is 
important for even text 
data. UNICODE supports 
either endianness and has 
a mechanism for specifying 
which endianness is being 
used to represent the data. 
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int count_bits( unsigned int data ) 


{ 


int cnt = 0; 


while( data != 0 ) { 
data = data & (data — 1); 
cnt+-+; 


} 


return cnt; 


} 


Figure 3.6: Bit Counting: Method One 


instruction can be used to swap the bytes of the 16-bit registers that can be 
decomposed into 8-bit registers. For example: 


xchg ah,al ; swap bytes of ax 


3.6 Counting Bits 


Earlier a straightforward technique was given for counting the number 
of bits that are “on” in a double word. This section looks at other less direct 
methods of doing this as an exercise using the bit operations discussed in 
this chapter. 


3.6.1 Method one 


The first method is very simple, but not obvious. Figure 3.6 shows the 
code. 

How does this method work? In every iteration of the loop, one bit is 
turned off in data. When all the bits are off (i.e. when data is zero), the 
loop stops. The number of iterations required to make data zero is equal to 
the number of bits in the original value of data. 

Line 6 is where a bit of data is turned off. How does this work? Consider 
the general form of the binary representation of data and the rightmost 1 
in this representation. By definition, every bit after this 1 must be zero. 
Now, what will be the binary representation of data - 1? The bits to the 
left of the rightmost 1 will be the same as for data, but at the point of the 
rightmost 1 the bits will be the complement of the original bits of data. For 
example: 

data = xxxxx10000 
data - 1 = xxxxx01111 
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static unsigned char byte _bit_count [256]; /* lookup table */ 


void initialize_count_bits () 


{ 
int cnt, i, data; 
for( i =0; i < 256; i++) { 
cnt = 0; 
data = i; 
while( data != 0 ) { /* method one «/ 
data = data & (data — 1); 
ent++; 
} 
byte_bit_count [i] = cnt; 
} 
} 
int count_bits( unsigned int data ) 
{ 
const unsigned char « byte = ( unsigned char *) & data; 
return byte_bit_count [byte [0]] + byte_bit_count [byte [1]] + 
byte_bit_count [byte [2]] + byte-bit-count [byte [3]]; 
} 


Figure 3.7: Method Two 


where the x’s are the same for both numbers. When data is AND’ed with 
data - 1, the result will zero the rightmost 1 in data and leave all the other 
bits unchanged. 


3.6.2 Method two 


A lookup table can also be used to count the bits of an arbitrary double 
word. The straightforward approach would be to precompute the number 
of bits for each double word and store this in an array. However, there are 
two related problems with this approach. There are roughly 4 billion double 
word values! This means that the array will be very big and that initializing 
it will also be very time consuming. (In fact, unless one is going to actually 
use the array more than 4 billion times, more time will be taken to initialize 
the array than it would require to just compute the bit counts using method 
one!) 
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A more realistic method would precompute the bit counts for all possible 
byte values and store these into an array. Then the double word can be split 
up into four byte values. The bit counts of these four byte values are looked 
up from the array and sumed to find the bit count of the original double 
word. Figure 3.7 shows the to code implement this approach. 

The initialize_count_bits function must be called before the first call 
to the count_bits function. This function initializes the global byte_bit_count 
array. The count_bits function looks at the data variable not as a double 
word, but as an array of four bytes. The dword pointer acts as a pointer to 
this four byte array. Thus, dword[0] is one of the bytes in data (either the 
least significant or the most significant byte depending on if the hardware 
is little or big endian, respectively.) Of course, one could use a construction 
like: 


(data >> 24) & 0x000000FF 


to find the most significant byte value and similar ones for the other bytes; 
however, these constructions will be slower than an array reference. 

One last point, a for loop could easily be used to compute the sum on 
lines 22 and 23. But, a for loop would include the overhead of initializing a 
loop index, comparing the index after each iteration and incrementing the 
index. Computing the sum as the explicit sum of four values will be faster. 
In fact, a smart compiler would convert the for loop version to the explicit 
sum. This process of reducing or eliminating loop iterations is a compiler 
optimization technique known as loop unrolling. 


3.6.3 Method three 


There is yet another clever method of counting the bits that are on in 
data. This method literally adds the one’s and zero’s of the data together. 
This sum must equal the number of one’s in the data. For example, consider 
counting the one’s in a byte stored in a variable named data. The first step 
is to perform the following operation: 


data = (data & 0x55) + ((data >> 1) & 0x55); 


What does this do? The hex constant 0x55 is 01010101 in binary. In the 
first operand of the addition, data is AND’ed with this, bits at the odd 
bit positions are pulled out. The second operand ((data >> 1) & 0x55) 
first moves all the bits at the even positions to an odd position and uses 
the same mask to pull out these same bits. Now, the first operand contains 
the odd bits and the second operand the even bits of data. When these 
two operands are added together, the even and odd bits of data are added 
together. For example, if data is 101100112, then: 
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int count_bits (unsigned int x ) 
{ 
static unsigned int mask[] = { 0x55555555, 
0x33333333, 


0x0FOFOFOF, 
0x00FF00FF, 


0x0000F FFF 
int i; 


int shift; /* number of positions to shift to right */ 


t 


for( i=0, shift=1; i < 5; i++, shift x= 2 ) 
x = (x & mask[i]) + ( (x >> shift) & mask[i] ); 


return x; 
} 
Figure 3.8: Method 3 
data & 010101012 00 | 01 | 00 | 01 
+ (data >> 1) & 010101012 or +] 01] 01) 00} 01 
01 | 10 | 00 | 10 


The addition on the right shows the actual bits adde 
bits of the byte are divided into four 2-bit fields to show that actually there 
are four independent additions being performed. Since the most these sums 
can be is two, there is no possibility that the sum will overflow its field and 


corrupt one of the other field’s sums. 


together. The 


Of course, the total number of bits have not been computed yet. How- 
ever, the same technique that was used above can be used to compute the 
total in a series of similar steps. The next step would be: 


data = (data & 0x33) + ((data >> 2) & 0x33); 


Continuing the above example (remember that data now is 011000102): 


data & 001100115 0010 | 0010 
+ (data >> 2) & 001100112 or + | 0001 | 0000 
0011 | 0010 


Now there are two 4-bit fields to that are independently added. 
The next step is to add these two bit sums together to form the final 


result: 


data = (data & Ox0F) + ((data >> 4) & Ox0F); 
Using the example above (with data equal to 001100102): 


data & 000011112 00000010 
+ (data >> 4) & 000011112 or + 00000011 
00000101 
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Now data is 5 which is the correct result. Figure 3.8 shows an implemen- 
tation of this method that counts the bits in a double word. It uses a for 
loop to compute the sum. It would be faster to unroll the loop; however, the 
loop makes it clearer how the method generalizes to different sizes of data. 


Chapter 4 


Subprograms 


This chapter looks at using subprograms to make modular programs and 
to interface with high level languages (like C). Functions and procedures are 
high level language examples of subprograms. 

The code that calls a subprogram and the subprogram itself must agree 
on how data will be passed between them. These rules on how data will 
be passed are called calling conventions. A large part of this chapter will 
deal with the standard C calling conventions that can be used to interface 
assembly subprograms with C programs. This (and other conventions) often 
pass the addresses of data (i.e. pointers) to allow the subprogram to access 
the data in memory. 


4.1 Indirect Addressing 


Indirect addressing allows registers to act like pointer variables. To in- 
dicate that a register is to be used indirectly as a pointer, it is enclosed in 
square brackets ([]). For example: 


mov ax, [Data] ; normal direct memory addressing of a word 
mov ebx, Data ; ebx = & Data 
mov ax, [ebx] ; ax = *ebx 


Because AX holds a word, line 3 reads a word starting at the address stored 
in EBX. If AX was replaced with AL, only a single byte would be read. 
It is important to realize that registers do not have types like variables do 
in C. What EBX is assumed to point to is completely determined by what 
instructions are used. Furthermore, even the fact that EBX is a pointer is 
completely determined by the what instructions are used. If EBX is used 
incorrectly, often there will be no assembler error; however, the program 
will not work correctly. This is one of the many reasons that assembly 
programming is more error prone than high level programming. 
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All the 32-bit general purpose (EAX, EBX, ECX, EDX) and index (ESI, 
EDI) registers can be used for indirect addressing. In general, the 16-bit and 
8-bit registers can not be. 


4.2 Simple Subprogram Example 


A subprogram is an independent unit of code that can be used from 
different parts of a program. In other words, a subprogram is like a function 
in C. A jump can be used to invoke the subprogram, but returning presents 
a problem. If the subprogram is to be used by different parts of the program, 
it must return back to the section of code that invoked it. Thus, the jump 
back from the subprogram can not be hard coded to a label. The code below 
shows how this could be done using the indirect form of the JMP instruction. 
This form of the instruction uses the value of a register to determine where 
to jump to (thus, the register acts much like a function pointer in C.) Here 


is the first program from chapter 1 rewritten to use a subprogram. 


subl.asm 
; file: subi.asm 
; Subprogram example program 
‘include "asm_io.inc" 
segment .data 
prompti db "Enter a number: ", 0 ; don’t forget null terminator 
prompt2 db "Enter another number: ", 0 
outmsgi db "You entered ", O 
outmsg2 db " and ", 0 
outmsg3 db ", the sum of these is ", 0 
segment .bss 
inputi resd 1 
input2 resd 1 
segment .text 
global _asm_main 
_asm_main: 
enter 0,0 ; setup routine 
pusha 
mov eax, promptl ; print out prompt 
call print_string 
mov ebx, input1 ; store address of input1 into ebx 


28 


4.2. SIMPLE SUBPROGRAM EXAMPLE 


mov 
jmp 


reti: 


2 
2 
3 
3 
3 


2 


mov 
call 


mov 
mov 


jmp 


mov 
add 
mov 


mov 
call 
mov 
call 
mov 
call 
mov 
call 
mov 
call 
mov 
call 
call 


popa 
mov 


leave 


ret 


ecx, reti 
short get_int 


eax, prompt2 
print_string 


ebx, input2 
ecx, $ + 7 
short get_int 


eax, [input1] 
eax, [input2] 
ebx, eax 


eax, outmsg1 
print_string 
eax, [input1] 
print_int 
eax, outmsg2 
print_string 
eax, [input2] 
print_int 
eax, outmsg3 
print_string 
eax, ebx 
print_int 
print_nl 


eax, 0 


subprogram get_int 


Parameters: 


3 


ebx - address of dword to store 


ecx - address of instruction to 


Notes: 


value of eax is destroyed 


get_int: 


call 
mov 


jmp 


read_int 
[ebx], eax 
ecx 


subi.asm 
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store return address into ecx 


read integer 


print out 


prompt 


ecx = this address + 7 


eax = dword at inputl 
eax += dword at input2 


ebx = eax 


print out 
print out 
print out 
print out 
print out 


print out 


print new- 


first message 
input1 
second message 
input2 
third message 


sum (ebx) 
line 


return back to C 


integer into 


return to 


; store input into memory 
; jump back to caller 
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The get_int subprogram uses a simple, register-based calling conven- 
tion. It expects the EBX register to hold the address of the DWORD to 
store the number input into and the ECX register to hold the code address 
of the instruction to jump back to. In lines 25 to 28, the ret1 label is used 
to compute this return address. In lines 32 to 34, the $ operator is used to 
compute the return address. The $ operator returns the current address for 
the line it appears on. The expression $ + 7 computes the address of the 
MOV instruction on line 36. 

Both of these return address computations are awkward. The first 
method requires a label to be defined for each subprogram call. The second 
method does not require a label, but does require careful thought. If a near 
jump was used instead of a short jump, the number to add to $ would not 
be 7! Fortunately, there is a much simpler way to invoke subprograms. This 
method uses the stack. 


4.3 The Stack 


Many CPUs have built-in support for a stack. A stack is a Last-In First- 
Out (LIFO) list. The stack is an area of memory that is organized in this 
fashion. The PUSH instruction adds data to the stack and the POP instruction 
removes data. The data removed is always the last data added (that is why 
it is called a last-in first-out list). 

The SS segment register specifies the segment that contains the stack 
(usually this is the same segment data is stored into). The ESP register 
contains the address of the data that would be removed from the stack. 
This data is said to be at the top of the stack. Data can only be added in 
double word units. That is, one can not push a single byte on the stack. 

The PUSH instruction inserts a double word! on the stack by subtracting 
4 from ESP and then stores the double word at [ESP]. The POP instruction 
reads the double word at [ESP] and then adds 4 to ESP. The code below 
demonstrates how these instructions work and assumes that ESP is initially 
1000H. 


push dword 1 ; 1 stored at OFFCh, ESP = OFFCh 
push dword 2 ; 2 stored at OFF8h, ESP = OFF8h 
push dword 3 ; 3 stored at OFF4h, ESP = OFF4h 
pop eax ; EAX = 3, ESP = OFF8h 
pop ebx ; EBX = 2, ESP = OFFCh 
pop ecx ; ECX = 1, ESP = 1000h 


1 Actually words can be pushed too, but in 32-bit protected mode, it is better to work 
with only double words on the stack. 
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The stack can be used as a convenient place to store data temporarily. 
It is also used for making subprogram calls, passing parameters and local 
variables. 

The 80x86 also provides a PUSHA instruction that pushes the values of 
EAX, EBX, ECX, EDX, ESI, EDI and EBP registers (not in this order). 
The POPA instruction can be used to pop them all back off. 


4.4 The CALL and RET Instructions 


The 80x86 provides two instructions that use the stack to make calling 
subprograms quick and easy. The CALL instruction makes an uncondi- 
tional jump to a subprogram and pushes the address of the next instruction 
on the stack. The RET instruction pops off an address and jumps to that 
address. When using these instructions, it is very important that one man- 
age the stack correctly so that the right number is popped off by the RET 
instruction! 

The previous program can be rewritten to use these new instructions by 
changing lines 25 to 34 to be: 


mov ebx, input1 
call get_int 


mov ebx, input2 
call get_int 


and change the subprogram get_int to: 


get_int: 
call read_int 
mov [ebx], eax 
ret 


There are several advantages to CALL and RET: 
e It is simpler! 


e It allows subprograms calls to be nested easily. Notice that get_int 
calls read_int. This call pushes another address on the stack. At the 
end of read_int’s code is a RET that pops off the return address and 
jumps back to get_int’s code. Then when get_int’s RET is executed, 
it pops off the return address that jumps back to asm_main. This works 
correctly because of the LIFO property of the stack. 
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Remember it is very important to pop off all data that is pushed on the 
stack. For example, consider the following: 


get_int: 
call read_int 
mov [ebx], eax 
push eax 
ret ; pops off EAX value, not return address! 


This code would not return correctly! 


4.5 Calling Conventions 


When a subprogram is invoked, the calling code and the subprogram (the 
callee) must agree on how to pass data between them. High-level languages 
have standard ways to pass data known as calling conventions. For high-level 
code to interface with assembly language, the assembly language code must 
use the same conventions as the high-level language. The calling conventions 
can differ from compiler to compiler or may vary depending on how the code 
is compiled (e.g. if optimizations are on or not). One universal convention 
is that the code will be invoked with a CALL instruction and return via a 
RET. 

All PC C compilers support one calling convention that will be described 
in the rest of this chapter in stages. These conventions allow one to create 
subprograms that are reentrant. A reentrant subprogram may be called at 
any point of a program safely (even inside the subprogram itself). 


4.5.1 Passing parameters on the stack 


Parameters to a subprogram may be passed on the stack. They are 
pushed onto the stack before the CALL instruction. Just as in C, if the 
parameter is to be changed by the subprogram, the address of the data 
must be passed, not the value. If the parameter’s size is less than a double 
word, it must be converted to a double word before being pushed. 

The parameters on the stack are not popped off by the subprogram, 
instead they are accessed from the stack itself. Why? 


e Since they have to be pushed on the stack before the CALL instruction, 
the return address would have to be popped off first (and then pushed 
back on again). 


e Often the parameters will have to be used in several places in the 
subprogram. Usually, they can not be kept in a register for the entire 
subprogram and would have to be stored in memory. Leaving them 
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ESP + 4 Parameter 
ESP Return address 


Figure 4.1: 


ESP + 8 Parameter 
ESP + 4 | Return address 
ESP subprogram data 


Figure 4.2: 


on the stack keeps a copy of the data in memory that can be accessed 
at any point of the subprogram. 


Consider a subprogram that is passed a single parameter on the stack. When using indirect ad- 
When the subprogram is invoked, the stack looks like Figure 4.1. The pa- dressing, the 80x86 proces- 


rameter can be accessed using indirect addressing ( [ESP+4] °). sor accesses different seg- 
ments depending on what 


If the stack is also used inside the subprogram to store data, the number , ' 
registers are used in the 


needed to be added to ESP will change. For example, Figure 4.2 shows what . ~”. i 
AN : . indirect addressing expres- 

the stack looks like if a DWORD is pushed the stack. Now the parameter is sion. ESP (and EBP) 
at ESP + 8 not ESP + 4. Thus, it can be very error prone to use ESP when use the stack segment while 
referencing parameters. To solve this problem, the 80386 supplies another EAX, EBX, ECX and 
register to use: EBP. This register’s only purpose is to reference data on the EDX use the data segment. 
stack. The C calling convention mandates that a subprogram first save the However, this is usually 
value of EBP on the stack and then set EBP to be equal to ESP. This allows ¥nimportant for most pro- 
ESP to change as data is pushed or popped off the stack without modifying o eas pm hie 
EBP. At the end of the subprogram, the original value of EBP must be snd stack segments are the 
restored (this is why it is saved at the start of the subprogram.) Figure 4.3 same. 
shows the general form of a subprogram that follows these conventions. 

Lines 2 and 3 in Figure 4.3 make up the general prologue of a subprogram. 
Lines 5 and 6 make up the epilogue. Figure 4.4 shows what the stack looks 
like immediately after the prologue. Now the parameter can be access with 
[EBP + 8] at any place in the subprogram without worrying about what 
else has been pushed onto the stack by the subprogram. 

After the subprogram is over, the parameters that were pushed on the 
stack must be removed. The C calling convention specifies that the caller 
code must do this. Other conventions are different. For example, the Pascal 
calling convention specifies that the subprogram must remove the parame- 


It is legal to add a constant to a register when using indirect addressing. More 
complicated expressions are possible too. This topic is covered in the next chapter 
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subprogram_label: 
push ebp ; Save original EBP value on stack 
mov ebp, esp ; new EBP = ESP 
; subprogram code 
pop ebp ; restore original EBP value 
ret 


Figure 4.3: General subprogram form 


ESP +8 EBP +8 Parameter 
ESP +4 EBP + 4 | Return address 
ESP EBP saved EBP 


Figure 4.4: 


ters. (There is another form of the RET instruction that makes this easy to 
do.) Some C compilers support this convention too. The pascal keyword is 
used in the prototype and definition of the function to tell the compiler to 
use this convention. In fact, the stdcall convention that the MS Windows 
API C functions use also works this way. What is the advantage of this way? 
It is a little more efficient than the C convention. Why do all C functions 
not use this convention, then? In general, C allows a function to have vary- 
ing number of arguments (e.g., the printf and scanf functions). For these 
types of functions, the operation to remove the parameters from the stack 
will vary from one call of the function to the next. The C convention allows 
the instructions to perform this operation to be easily varied from one call 
to the next. The Pascal and stdcall convention makes this operation very 
difficult. Thus, the Pascal convention (like the Pascal language) does not 
allow this type of function. MS Windows can use this convention since none 
of its API functions take varying numbers of arguments. 

Figure 4.5 shows how a subprogram using the C calling convention would 
be called. Line 3 removes the parameter from the stack by directly manipu- 
lating the stack pointer. A POP instruction could be used to do this also, but 
would require the useless result to be stored in a register. Actually, for this 
particular case, many compilers would use a POP ECX instruction to remove 
the parameter. The compiler would use a POP instead of an ADD because the 
ADD requires more bytes for the instruction. However, the POP also changes 
ECX’s value! Next is another example program with two subprograms that 
use the C calling conventions discussed above. Line 54 (and other lines) 
shows that multiple data and text segments may be declared in a single 
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push dword 1 ; pass 1 as parameter 
call fun 
add esp, 4 ; remove parameter from stack 


Figure 4.5: Sample subprogram call 


source file. They will be combined into single data and text segments in 
the linking process. Splitting up the data and code into separate segments 
allow the data that a subprogram uses to be defined close by the code of the 
subprogram. 


sub3.asm 


include "asm_io.inc" 


segment .data 
sum dd 0 


segment .bss 
input resd 1 


; pseudo-code algorithm 

piret; 

; sum = 0; 

; while( get_int(i, &input), input != 0 ) { 
; sum += input; 

; itt; 
>} 

; print_sum (num); 
segment .text 


global _asm_main 


_asm_main: 

enter 0,0 ; setup routine 

pusha 

mov edx, 1 ; edx is ’i’ in pseudo-code 
while_loop: 

push edx ; save i on stack 

push dword input ; push address of input on stack 

call get_int 

add esp, 8 ; remove i and &input from stack 
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mov eax, [input] 

cmp eax, 0 

je end_while 

add [sum], eax ; sum += input 
inc edx 

jmp short while_loop 


end_while: 


3’ 
3 
eT 
eT 
3’ 


3’ 


push dword [sum] ; push value of sum onto stack 
call print_sum 

pop ecx ; remove [sum] from stack 
popa 

leave 

ret 


subprogram get_int 


; Parameters (in order pushed on stack) 


number of input (at [ebp + 12]) 

address of word to store input into (at [ebp + 8]) 
Notes: 

values of eax and ebx are destroyed 


segment .data 


prompt db ") Enter an integer number (0 to quit): 


segment .text 


Ms (0) 


get_int: 
push ebp 
mov ebp, esp 
mov eax, [ebp + 12] 
call print_int 
mov eax, prompt 
call print_string 
call read_int 
mov ebx, [ebp + 8] 
mov [ebx], eax ; store input into memory 
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pop ebp 
ret ; jump back to caller 


; subprogram print_sum 

; prints out the sum 

; Parameter: 

; sum to print out (at [ebpts]) 
; Note: destroys value of eax 


segment .data 
result db "The sum is ", 0 


segment .text 
print_sum: 


push ebp 

mov ebp, esp 

mov eax, result 

call print_string 
mov eax, [ebp+8] 
call print_int 


call print_nl 


pop ebp 


TEL sub3.asm 


4.5.2 Local variables on the stack 


The stack can be used as a convenient location for local variables. This is 
exactly where C stores normal (or automatic in C lingo) variables. Using the 
stack for variables is important if one wishes subprograms to be reentrant. 
A reentrant subprogram will work if it is invoked at any place, including the 
subprogram itself. In other words, reentrant subprograms can be invoked 
recursively. Using the stack for variables also saves memory. Data not stored 
on the stack is using memory from the beginning of the program until the 
end of the program (C calls these types of variables global or static). Data 
stored on the stack only use memory when the subprogram they are defined 
for is active. 

Local variables are stored right after the saved EBP value in the stack. 
They are allocated by subtracting the number of bytes required from ESP 
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1 |subprogram_label: 
2 push ebp ; Save original EBP value on stack 
3 mov ebp, esp ; new EBP = ESP 
4 sub esp, LOCAL_BYTES ; = # bytes needed by locals 
5 |; subprogram code 
6 mov esp, ebp ; deallocate locals 
7 pop ebp ; restore original EBP value 
8 ret 


Despite the fact that ENTER 
and LEAVE simplify the 
prologue and epilogue they 
are not used very often. 
Why? Because they are 
slower than the equivalent 
simpler instructions! This 
is an example of when 
one can not assume that a 
one instruction sequence is 
faster than a multiple in- 
struction one. 


Figure 4.6: General subprogram form with local variables 


void calc_sum( int n, int * sump ) 


{ 


int i, sum = 0; 


for( i=1; i <= n; i++ ) 
sum += i; 
xsump = sum; 


} 


Figure 4.7: C version of sum 


in the prologue of the subprogram. Figure 4.6 shows the new subprogram 
skeleton. The EBP register is used to access local variables. Consider the 
C function in Figure 4.7. Figure 4.8 shows how the equivalent subprogram 
could be written in assembly. 


Figure 4.9 shows what the stack looks like after the prologue of the pro- 
gram in Figure 4.8. This section of the stack that contains the parameters, 
return information and local variable storage is called a stack frame. Every 
invocation of a C function creates a new stack frame on the stack. 


The prologue and epilogue of a subprogram can be simplified by using 
two special instructions that are designed specifically for this purpose. The 
ENTER instruction performs the prologue code and the LEAVE performs the 
epilogue. The ENTER instruction takes two immediate operands. For the C 
calling convention, the second operand is always 0. The first operand is the 
number of bytes needed by local variables. The LEAVE instruction has no 
operands. Figure 4.10 shows how these instructions are used. Note that the 
program skeleton (Figure 1.7) also uses ENTER and LEAVE. 
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cal_sum: 


for_loop: 


end_for: 


push ebp 

mov ebp, esp 

sub esp, 4 ; make room for local sum 
mov dword [ebp - 4], 0 ; sum = 0 

mov ebx, 1 ; ebx (i) = 1 
cmp ebx, [ebp+8] ; is i <= n? 
jale end_for 

add [ebp-4], ebx ; sum += i 

inc ebx 

jmp short for_loop 

mov ebx, [ebp+12] ; ebx = sump 
mov eax, [ebp-4] ; eax = sum 
mov [ebx], eax ; *sump = sum; 
mov esp, ebp 

pop ebp 

ret 


Figure 4.8: Assembly version of sum 


4.6 Multi-Module Programs 


A multi-module program is one composed of more than one object file. 
All the programs presented here have been multi-module programs. They 
consisted of the C driver object file and the assembly object file (plus the 
C library object files). Recall that the linker combines the object files into 
a single executable program. The linker must match up references made 
to each label in one module (i.e. object file) to its definition in another 
module. In order for module A to use a label defined in module B, the 
extern directive must be used. After the extern directive comes a comma 
delimited list of labels. The directive tells the assembler to treat these 
labels as external to the module. That is, these are labels that can be used 
in this module, but are defined in another. The asm_io.inc file defines the 
read_int, etc. routines as external. 

In assembly, labels can not be accessed externally by default. If a label 
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ESP +16 EBP + 12 sump 
ESP +12 EBP+8 n 
ESP +8 EBP +4 | Return address 
ESP +4 EBP saved EBP 
ESP EBP - 4 sum 
Figure 4.9: 
subprogram_label: 
enter LOCAL_BYTES, 0 ; = # bytes needed by locals 
; subprogram code 
leave 
ret 


Figure 4.10: General subprogram form with local variables using ENTER and 
LEAVE 


can be accessed from other modules than the one it is defined in, it must 
be declared global in its module. The global directive does this. Line 13 
of the skeleton program listing in Figure 1.7 shows the _asm_main label 
being defined as global. Without this declaration, there would be a linker 
error. Why? Because the C code would not be able to refer to the internal 
_asm_main label. 

Next is the code for the previous example, rewritten to use two modules. 
The two subprograms (get_int and print_sum) are in a separate source file 
than the _asm_main routine. 


main4.asm 


include "asm_io.inc" 


segment .data 
sum dd 0 


segment .bss 
input resd 1 


segment .text 
global _asm_main 
extern get_int, print_sum 
_asm_main: 
enter 0,0 ; setup routine 
pusha 
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mov 

while_loop: 
push 
push 
call 
add 


mov 
cmp 
je 


add 


inc 
jmp 


end_while: 
push 
call 


pop 


popa 
leave 
ret 


edx, 1 ; edx is ’i’ in pseudo-code 

edx ; save i on stack 

dword input ; push address on input on stack 
get_int 

esp, 8 ; remove i and &input from stack 


eax, [input] 
eax, 0 
end_while 


[sum], eax ; sum += input 


edx 
short while_loop 


dword [sum] ; push value of sum onto stack 
print_sum 
ecx ; remove [sum] from stack 


main4.asm 


sub4.asm 


include "asm_io.inc" 


segment .data 
prompt db 


segment .text 
global 
get_int: 
enter 


mov 
call 


mov 
call 


") Enter an integer number (0 to quit): ", O 


get_int, print_sum 
0,0 


eax, [ebp + 12] 
print_int 


eax, prompt 
print_string 
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call read_int 

mov ebx, [ebp + 8] 

mov [ebx], eax ; store input into memory 
leave 

ret ; jump back to caller 


segment .data 
result db "The sum is ", 0 


segment .text 
print_sum: 
enter 0,0 


mov eax, result 

call print_string 
mov eax, [ebp+8] 
call print_int 


call print_nl 


leave 


ret sub4.asm 


The previous example only has global code labels; however, global data 
labels work exactly the same way. 


4.7 Interfacing Assembly with C 


Today, very few programs are written completely in assembly. Compilers 
are very good at converting high level code into efficient machine code. Since 
it is much easier to write code in a high level language, it is more popular. 
In addition, high level code is much more portable than assembly! 

When assembly is used, it is often only used for small parts of the code. 
This can be done in two ways: calling assembly subroutines from C or 
inline assembly. Inline assembly allows the programmer to place assembly 
statements directly into C code. This can be very convenient; however, there 
are disadvantages to inline assembly. The assembly code must be written in 
the format the compiler uses. No compiler at the moment supports NASM’s 
format. Different compilers require different formats. Borland and Microsoft 
require MASM format. DJGPP and Linux’s gcc require GAS? format. The 


3GAS is the assembler that all GNU compiler’s use. It uses the AT&T syntax which 
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segment .data 
X dd 0 
format db "x = %d\n", O 


segment .text 


push dword [x] ; push x’s value 

push dword format ; push address of format string 
call _printf ; note underscore! 

add esp, 8 ; remove parameters from stack 


Figure 4.11: Call to printf 


technique of calling an assembly subroutine is much more standardized on 
the PC. 
Assembly routines are usually used with C for the following reasons: 


e Direct access is needed to hardware features of the computer that are 
difficult or impossible to access from C. 


e The routine must be as fast as possible and the programmer can hand 
optimize the code better than the compiler can. 


The last reason is not as valid as it once was. Compiler technology has 
improved over the years and compilers can often generate very efficient code 
(especially if compiler optimizations are turned on). The disadvantages of 
assembly routines are: reduced portability and readability. 

Most of the C calling conventions have already been specified. However, 
there are a few additional features that need to be described. 


4.7.1 Saving registers 


First, C assumes that a subroutine maintains the values of the following 
registers: EBX, ESI, EDI, EBP, CS, DS, SS, ES. This does not mean that 
the subroutine can not change them internally. Instead, it means that if 
it does change their values, it must restore their original values before the 
subroutine returns. The EBX, ESI and EDI values must be unmodified 
because C uses these registers for register variables. Usually the stack is 
used to save the original values of these registers. 


is very different from the relatively similar syntaxes of MASM, TASM and NASM. 


The register keyword can 
be used in a C variable dec- 
laration to suggest to the 
compiler that it use a reg- 
ister for this variable in- 
stead of a memory loca- 
tion. These are known as 
register variables. Mod- 
ern compilers do this auto- 
matically without requiring 
any suggestions. 
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It is not necessary to use 
assembly to process an ar- 
bitrary number of argu- 
ments in C. The stdarg.h 
header file defines macros 
that can be used to process 
them portably. See any 
good C book for details. 


82 CHAPTER 4. SUBPROGRAMS 


EBP + 12 value of x 

EBP + 8 | address of format string 
EBP + 4 Return address 
EBP saved EBP 


Figure 4.12: Stack inside printf 


4.7.2 Labels of functions 


Most C compilers prepend a single underscore(_) character at the be- 
ginning of the names of functions and global/static variables. For example, 
a function named f will be assigned the label _f. Thus, if this is to be an 
assembly routine, it must be labelled _f, not f£. The Linux gcc compiler does 
not prepend any character. Under Linux ELF executables, one simply would 
use the label f for the C function f. However, DJGPP’s gcc does prepend 
an underscore. Note that in the assembly skeleton program (Figure 1.7), 
the label for the main routine is -asm main. 


4.7.3 Passing parameters 


Under the C calling convention, the arguments of a function are pushed 
on the stack in the reverse order that they appear in the function call. 

Consider the following C statement: printf("x = %d\n",x); Figure 4.11 
shows how this would be compiled (shown in the equivalent NASM format). 
Figure 4.12 shows what the stack looks like after the prologue inside the 
printf function. The printf function is one of the C library functions that 
can take any number of arguments. The rules of the C calling conventions 
were specifically written to allow these types of functions. Since the address 
of the format string is pushed last, its location on the stack will always be at 
EBP + 8 no matter how many parameters are passed to the function. The 
printf code can then look at the format string to determine how many 
parameters should have been passed and look for them on the stack. 

Of course, if a mistake is made, printf("x = %d\n"), the printf code 
will still print out the double word value at [EBP + 12]. However, this will 
not be x’s value! 


4.7.4 Calculating addresses of local variables 


Finding the address of a label defined in the data or bss segments is 
simple. Basically, the linker does this. However, calculating the address 
of a local variable (or parameter) on the stack is not as straightforward. 
However, this is a very common need when calling subroutines. Consider 
the case of passing the address of a variable (let’s call it x) to a function 
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(let’s call it foo). If x is located at EBP — 8 on the stack, one cannot just 
use: 


mov eax, ebp - 8 


Why? The value that MOV stores into EAX must be computed by the as- 
sembler (that is, it must in the end be a constant). However, there is an 
instruction that does the desired calculation. It is called LEA (for Load Ef- 
fective Address). The following would calculate the address of x and store 
it into EAX: 


lea eax, [ebp - 8] 


Now EAX holds the address of x and could be pushed on the stack when 
calling function foo. Do not be confused, it looks like this instruction is 
reading the data at [EBP—8]; however, this is not true. The LEA instruction 
never reads memory! It only computes the address that would be read 
by another instruction and stores this address in its first register operand. 
Since it does not actually read any memory, no memory size designation 
(e.g. dword) is needed or allowed. 


4.7.5 Returning values 


Non-void C functions return back a value. The C calling conventions 
specify how this is done. Return values are passed via registers. All integral 
types (char, int, enum, etc.) are returned in the EAX register. If they 
are smaller than 32-bits, they are extended to 32-bits when stored in EAX. 
(How they are extended depends on if they are signed or unsigned types.) 
64-bit values are returned in the EDX:EAX register pair. Pointer values 
are also stored in EAX. Floating point values are stored in the STO register 
of the math coprocessor. (This register is discussed in the floating point 
chapter.) 


4.7.6 Other calling conventions 


The rules above describe the standard C calling convention that is sup- 
ported by all 80x86 C compilers. Often compilers support other calling 
conventions as well. When interfacing with assembly language it is very 
important to know what calling convention the compiler is using when it 
calls your function. Usually, the default is to use the standard calling con- 
vention; however, this is not always the caset. Compilers that use multiple 
conventions often have command line switches that can be used to change 


“The Watcom C compiler is an example of one that does not use the standard conven- 
tion by default. See the example source code file for Watcom for details 
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the default convention. They also provide extensions to the C syntax to 
explicitly assign calling conventions to individual functions. However, these 
extensions are not standardized and may vary from one compiler to another. 

The GCC compiler allows different calling conventions. The convention 
of a function can be explicitly declared by using the __attribute__ exten- 
sion. For example, to declare a void function that uses the standard calling 
convention named f that takes a single int parameter, use the following 
syntax for its prototype: 


void f( int ) __attribute__((cdecl )); 


GCC also supports the standard call calling convention. The function above 
could be declared to use this convention by replacing the cdec1 with stdcall. 
The difference in stdcall and cdecl is that stdcall requires the subroutine 
to remove the parameters from the stack (as the Pascal calling convention 
does). Thus, the stdcall convention can only be used with functions that 
take a fixed number of arguments (i.e. ones not like printf and scanf). 

GCC also supports an additional attribute called regparm that tells the 
compiler to use registers to pass up to 3 integer arguments to a function 
instead of using the stack. This is a common type of optimization that 
many compilers support. 

Borland and Microsoft use a common syntax to declare calling conven- 
tions. They add the _.cdecl and __stdcall keywords to C. These keywords 
act as function modifiers and appear immediately before the function name 
in a prototype. For example, the function f above would be defined as 
follows for Borland and Microsoft: 


void _cdecl f( int ); 


There are advantages and disadvantages to each of the calling conven- 
tions. The main advantages of the cdecl convention are that it is simple 
and very flexible. It can be used for any type of C function and C compiler. 
Using other conventions can limit the portability of the subroutine. Its main 
disadvantage is that it can be slower than some of the others and use more 
memory (since every invocation of the function requires code to remove the 
parameters on the stack). 

The advantage of the stdcall convention is that it uses less memory than 
cdecl. No stack cleanup is required after the CALL instruction. Its main 
disadvantage is that it can not be used with functions that have variable 
numbers of arguments. 

The advantage of using a convention that uses registers to pass integer 
parameters is speed. The main disadvantage is that the convention is more 
complex. Some parameters may be in registers and others on the stack. 
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4.7.7 Examples 


Next is an example that shows how an assembly routine can be interfaced 
to a C program. (Note that this program does not use the assembly skeleton 
program (Figure 1.7) or the driver.c module.) 


main5.c 


1 include <stdio.h> 
2 /* prototype for assembly routine */ 
3 void calc_sum( int, int x ) _attribute__((cdecl)); 


5 int main( void ) 
s { 


7 int n, sum; 


9 printf ("Sum integers up to: ” ); 
10  scanf(”%d", &n); 

uu calc_sum(n, &sum); 

12 printf ("Sum is %d\n”, sum); 

13 return 0; 


14 } 


main5.c 


sub5.asm 


; subroutine _calc_sum 

; finds the sum of the integers 1 through n 
; Parameters: 

; n - what to sum up to (at [ebp + 8]) 
; sump - pointer to int to store sum into (at [ebp + 12]) 
; pseudo C code: 

; void calc_sum( int n, int * sump ) 

if 

; int i, sum = 0; 

p for( i=1; i <= n; i++ ) 

; sum += i; 

; xsump = sum; 


; } 


segment .text 
global _calc_sum 
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Stack 
EBP = 
+16 


Sum integers up to: 10 


Dump # 1 


BFFFFB70 ESP = BFFFFB68 


BFFFFB80 080499EC 
BFFFFB7C BFFFFB80 
BFFFFB78 00000004A 
BFFFFB74 08048501 
BFFFFB70 BFFFFB88 
BFFFFB6C 00000000 
BFFFFB68 4010648C 


Sum is 55 


Figure 4.13: Sample run of sub5 program 


; local variable: 
; sum at [ebp-4] 


_calc_sum: 


enter 4,0 
push ebx 


mov 


dword [ebp-4],0 


dump_stack 1, 2, 4 


mov ecx, 1 
for_loop: 

cmp ecx, [ebp+8] 

jnle end_for 

add [ebp-4], ecx 

inc ecx 

jmp short for_loop 
end_for: 

mov ebx, [ebp+12] 

mov eax, [ebp-4] 

mov [ebx], eax 

pop ebx 

leave 

ret 


sub5.asm 


; make room for sum on stack 


IMPORTANT! 


sum = 0 


; print out stack from ebp-8 to ebp+16 
; ecx is i in pseudocode 


cmp i and n 
if not i <= n, quit 


sum += i 


; ebx = sump 
; eax = sum 


; restore ebx 


Why is line 22 of sub5.asm so important? Because the C calling con- 
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vention requires the value of EBX to be unmodified by the function call. If 
this is not done, it is very likely that the program will not work correctly. 

Line 25 demonstrates how the dump_stack macro works. Recall that the 
first parameter is just a numeric label, and the second and third parameters 
determine how many double words to display below and above EBP respec- 
tively. Figure 4.13 shows an example run of the program. For this dump, 
one can see that the address of the dword to store the sum is BFFFFB80 (at 
EBP + 12); the number to sum up to is 0000000A (at EBP + 8); the return 
address for the routine is 08048501 (at EBP + 4); the saved EBP value is 
BFFFFB88 (at EBP); the value of the local variable is 0 at (EBP - 4); and 
finally the saved EBX value is 4010648C (at EBP - 8). 

The calc_sum function could be rewritten to return the sum as its return 
value instead of using a pointer parameter. Since the sum is an integral 
value, the sum should be left in the EAX register. Line 11 of the main5.c 
file would be changed to: 


sum = calc_sum(n); 


Also, the prototype of calc_sum would need be altered. Below is the modi- 
fied assembly code: 


sub6.asm 


2 


3 


2 


subroutine _calc_sum 
finds the sum of the integers 1 through n 
Parameters: 

n - what to sum up to (at [ebp + 8]) 
Return value: 

value of sum 
pseudo C code: 
int calc_sum( int n ) 
{ 

int i, sum = 0; 

for( i=1; i <= n; i++ ) 

sum += i; 
return sum; 


} 


segment .text 


2 


2 


global _calc_sum 


local variable: 
sum at [ebp-4] 


_calc_sum: 


enter 4,0 ; make room for sum on stack 
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segment .data 
format 


segment .text 


db "Yq" $ (0) 


lea eax, [ebp-16] 
push eax 
push dword format 
call _scanf 
add esp, 8 
Figure 4.14: Calling scanf from assembly 
mov dword [ebp-4] ,0 ; sum = 0 
mov ecx, 1 ; ecx is i in pseudocode 
for_loop: 
cmp ecx, [ebp+8] ; cmp i and n 
jnle end_for ; if not i <= n, quit 
add [ebp-4], ecx ; sum += i 
inc ecx 
jmp short for_loop 
end_for: 
mov eax, [ebp-4] ; eax = sum 
leave 
TSt sub6.asm 


4.7.8 Calling C functions from assembly 


One great advantage of interfacing C and assembly is that allows as- 
sembly code to access the large C library and user-written functions. For 
example, what if one wanted to call the scanf function to read in an integer 
from the keyboard? Figure 4.14 shows code to do this. One very important 
point to remember is that scanf follows the C calling standard to the letter. 
This means that it preserves the values of the EBX, ESI and EDI registers; 
however, the EAX, ECX and EDX registers may be modified! In fact, EAX 
will definitely be changed, as it will contain the return value of the scanf 
call. For other examples of using interfacing with C, look at the code in 
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asm_io.asm which was used to create asm_io.obj. 


4.8 Reentrant and Recursive Subprograms 


A reentrant subprogram must satisfy the following properties: 


e It must not modify any code instructions. In a high level language 
this would be difficult, but in assembly it is not hard for a program to 
try to modify its own code. For example: 


mov word [cs:$+7], 5 ; copy 5 into the word 7 bytes ahead 
add ax, 2 ; previous statement changes 2 to 5! 


This code would work in real mode, but in protected mode operating 
systems the code segment is marked as read only. When the first line 
above executes, the program will be aborted on these systems. This 
type of programming is bad for many reasons. It is confusing, hard to 
maintain and does not allow code sharing (see below). 


e It must not modify global data (such as data in the data and the bss 
segments). All variables are stored on the stack. 


There are several advantages to writing reentrant code. 
e A reentrant subprogram can be called recursively. 


e A reentrant program can be shared by multiple processes. On many 
multi-tasking operating systems, if there are multiple instances of a 
program running, only one copy of the code is in memory. Shared 
libraries and DLL’s (Dynamic Link Libraries) use this idea as well. 


e Reentrant subprograms work much better in multi-threaded ° pro- 
grams. Windows 9x/NT and most UNIX-like operating systems (So- 
laris, Linux, etc.) support multi-threaded programs. 


4.8.1 Recursive subprograms 


These types of subprograms call themselves. The recursion can be either 
direct or indirect. Direct recursion occurs when a subprogram, say foo, calls 
itself inside foo’s body. Indirect recursion occurs when a subprogram is not 
called by itself directly, but by another subprogram it calls. For example, 
subprogram foo could call bar and bar could call foo. 


5A multi-threaded program has multiple threads of execution. That is, the program 
itself is multi-tasked. 
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; finds n! 
segment .text 
global _fact 
_fact: 
enter 0,0 
mov eax, [ebp+8] ; eax =n 
cmp eax, 1 
jbe term_cond ; if n <= 1, terminate 
dec eax 
push eax 
call _fact ; eax = fact(n-1) 
pop ecx ; answer in eax 
mul dword [ebp+8] ; edx:eax = eax * [ebp+8] 
jmp short end_fact 
term_cond: 
mov eax, 1 
end_fact: 
leave 
ret 


Figure 4.15: Recursive factorial function 


Recursive subprograms must have a termination condition. When this 
condition is true, no more recursive calls are made. If a recursive routine 
does not have a termination condition or the condition never becomes true, 
the recursion will never end (much like an infinite loop). 


Figure 4.15 shows a function that calculates factorials recursively. It 
could be called from C with: 
x = fact (3); /* find 3! */ 


Figure 4.16 shows what the stack looks like at its deepest point for the above 
function call. 


Figures 4.17 and 4.18 show another more complicated recursive example 
in C and assembly, respectively. What is the output is for f (3)? Note 
that the ENTER instruction creates a new i on the stack for each recursive 
call. Thus, each recursive instance of f has its own independent variable i. 
Defining i as a double word in the data segment would not work the same. 


o0 N a o 
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n(3) 


Return address 


Saved EBP 
n(2) 
n=2 frame Return address 


Saved EBP 
n(1) 
n=1 frame Return address 


Saved EBP 


n=3 frame 


Figure 4.16: Stack frames for factorial function 


void f( int x ) 
{ 
int i; 
for( i=0;i < x; i++) { 
printf ("%d\n", i); 
f(i); 
} 
} 


Figure 4.17: Another example (C version) 


4.8.2 Review of C variable storage types 


C provides several types of variable storage. 


global These variables are defined outside of any function and are stored 
at fixed memory locations (in the data or bss segments) and exist 
from the beginning of the program until the end. By default, they can 
be accessed from any function in the program; however, if they are 
declared as static, only the functions in the same module can access 
them (i.e. in assembly terms, the label is internal, not external). 


static These are local variables of a function that are declared static. 
(Unfortunately, C uses the keyword static for two different purposes!) 
These variables are also stored at fixed memory locations (in data or 
bss), but can only be directly accessed in the functions they are defined 
in. 
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define i ebp-4 


define x ebp+8 ; useful macros 
segment .data 
format db "%d", 10, O 
segment .text 

global _f 

extern _printf 
zf; 

enter 4,0 

mov dword [i], O 
lp: 

mov eax, Li] 

cmp eax, [x] 

jnl quit 

push eax 

push format 

call -printf 

add esp, 8 

push dword [i] 

call oe 

pop eax 

inc dword [i] 

jmp short lp 
quit: 

leave 

ret 


; allocate room on stack for i 


; 10 = ?\n’ 


call printf 


call f 


i++ 


Figure 4.18: Another example (assembly version) 
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automatic This is the default type for a C variable defined inside a func- 


tion. These variables are allocated on the stack when the function 
they are defined in is invoked and are deallocated when the function 
returns. Thus, they do not have fixed memory locations. 


register This keyword asks the compiler to use a register for the data in 


this variable. This is just a request. The compiler does not have to 
honor it. If the address of the variable is used anywhere in the program 
it will not be honored (since registers do not have addresses). Also, 
only simple integral types can be register values. Structured types 
can not be; they would not fit in a register! C compilers will often 
automatically make normal automatic variables into register variables 
without any hint from the programmer. 


volatile This keyword tells the compiler that the value of the variable may 


1 


2 


3 


change any moment. This means that the compiler can not make any 
assumptions about when the variable is modified. Often a compiler 
might store the value of a variable in a register temporarily and use 
the register in place of the variable in a section of code. It can not 
do these types of optimizations with volatile variables. A common 
example of a volatile variable would be one could be altered by two 
threads of a multi-threaded program. Consider the following code: 


x = 10; 
y = 20; 
Z= 


If x could be altered by another thread, it is possible that the other 
thread changes x between lines 1 and 3 so that z would not be 10. 
However, if the x was not declared volatile, the compiler might assume 
that x is unchanged and set z to 10. 


Another use of volatile is to keep the compiler from using a register 
for a variable. 


www.dbooks.org 


94 


CHAPTER 4. SUBPROGRAMS 


Chapter 5 


Arrays 


5.1 Introduction 


An array is a contiguous block of list of data in memory. Each element 
of the list must be the same type and use exactly the same number of bytes 
of memory for storage. Because of these properties, arrays allow efficient 
access of the data by its position (or index) in the array. The address of any 
element can be computed by knowing three facts: 


e The address of the first element of the array. 
e The number of bytes in each element 
e The index of the element 


It is convenient to consider the index of the first element of the array to 
be zero (just as in C). It is possible to use other values for the first index, 
but it complicates the computations. 


5.1.1 Defining arrays 


Defining arrays in the data and bss segments 


To define an initialized array in the data segment, use the normal db, 
dw, etc. directives. NASM also provides a useful directive named TIMES that 
can be used to repeat a statement many times without having to duplicate 
the statements by hand. Figure 5.1 shows several examples of these. 

To define an uninitialized array in the bss segment, use the resb, resw, 
etc. directives. Remember that these directives have an operand that spec- 
ifies how many units of memory to reserve. Figure 5.1 also shows examples 
of these types of definitions. 


95 
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segment .data 


; define array of 10 double words initialized to 1,2,..,10 
al dd 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 

; define array of 10 words initialized to 0 

a2 dw 0, 0, 0, O, O, O, O, 0, 0, O 

; same as before using TIMES 

a3 times 10 dw 0 


; define array of bytes with 200 0’s and then 100 1’s 


a4 times 200 db 0 
times 100 db 1 


segment .bss 

; define an array of 10 uninitialized double words 
a5 resd 10 

; define an array of 100 uninitialized words 

a6 resw 100 


Figure 5.1: Defining arrays 


Defining arrays as local variables on the stack 


There is no direct way to define a local array variable on the stack. 
As before, one computes the total bytes required by all local variables, 
including arrays, and subtracts this from ESP (either directly or using the 
ENTER instruction). For example, if a function needed a character variable, 
two double word integers and a 50 element word array, one would need 
1 +2 x 4+ 50 x 2 = 109 bytes. However, the number subtracted from ESP 
should be a multiple of four (112 in this case) to keep ESP on a double word 
boundary. One could arrange the variables inside this 109 bytes in several 
ways. Figure 5.2 shows two possible ways. The unused part of the first 
ordering is there to keep the double words on double word boundaries to 
speed up memory accesses. 


5.1.2 Accessing elements of arrays 


There is no [ ] operator in assembly language as in C. To access an 
element of an array, its address must be computed. Consider the following 
two array definitions: 


arrayl db 5, 4, 3, 2, 1 ; array of bytes 
array2 dw 5, 4, 3, 2, 1 ; array of words 


Here are some examples using these arrays: 
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EBP - 1 char 
unused 
EBP - 8 dword 1 
EBP - 12 | dword 2 word 
array 
word 
array EBP - 100 
EBP - 104 | dword 1 
EBP - 108 | dword 2 
EBP - 109 char 
EBP - 112 unused 


Figure 5.2: Arrangements of the stack 


mov al, [array1] ; al = array1[0] 

mov al, [array1 + 1] ; al = arrayi([1] 

mov [array1 + 3], al ; array1[3] = al 

mov ax, [array2] ; ax = array2[0] 

mov ax, [array2 + 2] ; ax = array2[1] (NOT array2[2]!) 
mov [array2 + 6], ax ; array2[3] = ax 

mov ax, [array2 + 1] ; ax = ?? 


In line 5, element 1 of the word array is referenced, not element 2. Why? 
Words are two byte units, so to move to the next element of a word array, 
one must move two bytes ahead, not one. Line 7 will read one byte from the 
first element and one from the second. In C, the compiler looks at the type 
of a pointer in determining how many bytes to move in an expression that 
uses pointer arithmetic so that the programmer does not have to. However, 
in assembly, it is up to the programmer to take the size of array elements in 
account when moving from element to element. 

Figure 5.3 shows a code snippet that adds all the elements of array1 
in the previous example code. In line 7, AX is added to DX. Why not 
AL? First, the two operands of the ADD instruction must be the same size. 
Secondly, it would be easy to add up bytes and get a sum that was too big 
to fit into a byte. By using DX, sums up to 65,535 are allowed. However, it 
is important to realize that AH is being added also. This is why AH is set 
to zero! in line 3. 

Figures 5.4 and 5.5 show two alternative ways to calculate the sum. The 
lines in italics replace lines 6 and 7 of Figure 5.3. 


‘Setting AH to zero is implicitly assuming that AL is an unsigned number. If it is 
signed, the appropriate action would be to insert a CBW instruction between lines 6 and 7 
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mov ebx, arrayl ; ebx = address of arrayl 
mov dx, O0 ; dx will hold sum 
mov ah, O sc ee 
mov ecx, 5 
lp: 
mov al, [ebx] ; al = *ebx 
add dx, ax ; dx += ax (not al!) 
inc ebx ; bxt++ 
loop lp 
Figure 5.3: Summing elements of an array (Version 1) 
mov ebx, arrayl ; ebx = address of array1 
mov dx, O0 ; dx will hold sum 
mov ecx, 5 
lp: 
add dl, [ebx] ; dl += *ebr 
gnc next ; tf no carry goto next 
inc dh ; inc dh 
nert: 
inc ebx ; bx++ 
loop lp 


Figure 5.4: Summing elements of an array (Version 2) 


5.1.3 More advanced indirect addressing 


Not surprisingly, indirect addressing is often used with arrays. The most 
general form of an indirect memory reference is: 


[ base reg + factor*index reg + constant] 
where: 


base reg is one of the registers EAX, EBX, ECX, EDX, EBP, ESP, ESI or 
EDI. 


factor is either 1, 2, 4 or 8. (If 1, factor is omitted.) 


index reg is one of the registers EAX, EBX, ECX, EDX, EBP, ESI, EDI. 
(Note that ESP is not in list.) 
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mov 
mov 
mov 

lp: 
add 
adc 
inc 
loop 


ebx, arrayl ; ebx = address of arrayl 
dx, 0 ; dx will hold sum 

ecx, 5 

dl, [ebx] ; dl += *ebz 

dh, O ; dh += carry flag + O 
ebx 3 bxt++ 

lp 


Figure 5.5: Summing elements of an array (Version 3) 


constant is a 32-bit constant. The constant can be a label (or a label 
expression). 


5.1.4 Example 


Here is an example that uses an array and passes it to a function. It 
uses the arrayic.c program (listed below) as a driver, not the driver.c 


program. 


arrayl.asm 


%define ARRAY_SIZE 100 
%define NEW_LINE 10 


segment .data 
FirstMsg 
Prompt 
SecondMsg 
ThirdMsg 
InputFormat 


segment .bss 
array 


segment .text 


db 
db 
db 
db 
db 


re 


extern 

global _a 
_asm_main: 

enter 4, 

push eb 

push es 


"First 10 elements of array", 0 

"Enter index of element to display: ", 0 
"Element %d is 4d", NEW_LINE, 0 
"Elements 20 through 29 of array", 0 
"4da", O 


sd ARRAY_SIZE 


_puts, _printf, _scanf, _dump_line 


sm_main 


(0) ; local dword variable at EBP - 4 
x 
i 
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; initialize array to 100, 99, 98, 97, 


mov 
mov 


init_loop: 


mov 
add 
loop 


push 
call 


pop 


push 
push 
call 
add 


ecx, ARRAY_SIZE 
ebx, array 


[ebx], ecx 
ebx, 4 
init_loop 


dword FirstMsg 
_puts 
ecx 


dword 10 
dword array 
_print_array 
esp, 8 


; prompt user for element index 
Prompt_loop: 


InputOK: 


push 
call 


pop 


lea 
push 
push 
call 
add 
cmp 
je 


call 
jmp 


mov 
push 
push 
push 
call 
add 


dword Prompt 


_printf 

ecx 

eax, [ebp-4] 
eax 


dword InputFormat 
_scanf 


esp, 8 
eax, 1 
InputOK 


_dump_line 
Prompt_loop 


esi, [ebp-4] 

dword [array + 4*esi] 
esi 

dword SecondMsg 
_printf 

esp, 12 


; eax 


CHAPTER 5. ARRAYS 


; print out FirstMsg 


; print first 10 elements of array 


address of local dword 


return value of scanf 


; dump rest of line and start over 
; if input invalid 


; print out value of element 
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push 
call 


pop 


push 
push 
call 
add 


pop 
pop 
mov 
leave 
ret 


dword ThirdMsg 
_puts 
ecx 


dword 10 
dword array + 20*4 
_print_array 


esp, 8 
esi 
ebx 
eax, 0 


; routine _print_array 
; C-callable routine that prints out elements of a double word array as 
; Signed integers. 


; C prototype: 


101 


; print out elements 20-29 


; address of array [20] 


; return back to C 


; void print_array( const int * a, int n); 


; Parameters: 


; a - pointer to array to print out (at ebp+8 on stack) 
; n - number of integers to print out (at ebp+12 on stack) 


segment .data 
OutputFormat 


segment .text 
global 

_print_array: 
enter 
push 
push 


xor 
mov 
mov 

print_loop: 
push 


db "Z-5d %5d", NEW_LINE, O 


_print_array 


0,0 
esi 
ebx 


esi, esi 
ecx, [ebpt+12] 
ebx, [ebp+8] 


ecx 


; esi = 0 
; ecx =n 
; ebx = address of array 


; printf might change ecx! 
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push dword [ebx + 4*esi] 
push esi 

push dword OutputFormat 
call _printf 

add esp, 12 

inc esi 

pop ecx 


loop print_loop 


pop ebx 
pop esi 
leave 

ret 


arrayl.asm 


#include <stdio.h> 


int asm_main( void ); 
void dump_line( void ); 


int main() 


{ 


} 


int ret_status ; 
ret_status = asm_main(); 
return ret_status ; 


|x 


x function dump_line 


CHAPTER 5. ARRAYS 


; push array [Lesi] 


; remove parameters (leave ecx!) 


arraylc.c 


x dumps all chars left in current line from input buffer 


*/ 


void dump_line() 


{ 


int ch; 


while( (ch = getchar()) != EOF && ch != '\n’) 


/* null bodyx/ ; 


arraylc.c 
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The LEA instruction revisited 


The LEA instruction can be used for other purposes than just calcuating 
addresses. A fairly common one is for fast computations. Consider the 
following: 


lea ebx, [4*eax + eax] 


This effectively stores the value of 5 x EAX into EBX. Using LEA to do this 
is both easier and faster than using MUL. However, one must realize that the 
expression inside the square brackets must be a legal indirect address. Thus, 
for example, this instruction can not be used to multiple by 6 quickly. 


5.1.5 Multidimensional Arrays 


Multidimensional arrays are not really very different than the plain one 
dimensional arrays already discussed. In fact, they are represented in mem- 
ory as just that, a plain one dimensional array. 


Two Dimensional Arrays 


Not surprisingly, the simplest multidimensional array is a two dimen- 
sional one. A two dimensional array is often displayed as a grid of elements. 
Each element is identified by a pair of indices. By convention, the first index 
is identified with the row of the element and the second index the column. 

Consider an array with three rows and two columns defined as: 


int a [3][2]; 


The C compiler would reserve room for a 6 (= 2 x 3) integer array and map 
the elements as follows: 


Index 0 1 2 3 4 5 
Element | a[0}[0] | afO][1] | a[1][0] | a[1][1] | a[2][0] | a[2][1] 


What the table attempts to show is that the element referenced as a [0] [0] 
is stored at the beginning of the 6 element one dimensional array. Element 
a[0] [1] is stored in the next position (index 1) and so on. Each row of the 
two dimensional array is stored contiguously in memory. The last element 
of a row is followed by the first element of the next row. This is known 
as the rowwise representation of the array and is how a C/C++ compiler 
would represent the array. 

How does the compiler determine where a[i] [j] appears in the rowwise 
representation? A simple formula will compute the index from i and j. The 
formula in this case is 2i + j. It’s not too hard to see how this formula is 
derived. Each row is two elements long; so, the first element of row 7 is 
at position 27. Then the position of column j is found by adding j to 2i. 
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mov eax, [ebp - 44] ; ebp - 44 is i’s location 

sal eax, 1 ; multiple i by 2 

add eax, [ebp - 48] ; add j 

mov eax, [ebp + 4*eax - 40] ; ebp - 40 is the address of a[0] [0] 
mov [ebp - 52], eax ; store result into x (at ebp - 52) 


Figure 5.6: Assembly for x = a[i][j] 


This analysis also shows how the formula is generalized to an array with N 
columns: N xi+j. Notice that the formula does not depend on the number 
of rows. 


As an example, let us see how gcc compiles the following code (using the 
array a defined above): 


x = afi ][j]: 


Figure 5.6 shows the assembly this is translated into. Thus, the compiler 
essentially converts the code to: 


x = *(&a[0][0] + 2xi +j); 


and in fact, the programmer could write this way with the same result. 
There is nothing magical about the choice of the rowwise representation 
of the array. A columnwise representation would work just as well: 


Index 0 1 2 3 4 5 
Element | a[0][0] | a[1][0] | a[2][0] | afO][1] | a[l] [1] | af2][1] 
In the columnwise representation, each column is stored contiguously. El- 
ement [i] [j] is stored at position i + 3j. Other languages (FORTRAN, 
for example) use the columnwise representation. This is important when 
interfacing code with multiple languages. 


Dimensions Above Two 


For dimensions above two, the same basic idea is applied. Consider a 
three dimensional array: 


int b [4][3][2]; 


This array would be stored like it was four two dimensional arrays each of 
size [3] [2] consecutively in memory. The table below shows how it starts 
out: 
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Index 0 1 2 3 4 5 
Element | b[o][o][o] | Þfojfo][1] | b[o]{[1][0] | blO}[1}[1) | bfo}f2}fo} | H[0][2][1] 
Index 6 7 8 9 10 11 
Element | b[1][0][0] | Þ[1]fo]f1] | bUL}[A}fO} | ba] | eielo | bl] [2][2] 


The formula for computing the position of b[i] [j] [k] is 6i + 27 +k. The 
6 is determined by the size of the [3] [2] arrays. In general, for an ar- 
ray dimensioned as a[L] [M] [N] the position of element a[i] [j] [k] will be 
MxNxi+Nxj+k. Notice again that the first dimension (L) does not 
appear in the formula. 

For higher dimensions, the same process is generalized. For an n dimen- 
sional array of dimensions Dı to Dn, the position of element denoted by the 
indices 71 to ip is given by the formula: 


Də x D3-++ x Dn X i1 + D3 x Da-++ x Dn x i2 +++ + Dn X in-1 + in 
or for the über math geek, it can be written more succinctly as: 
n n 
> | I Pe) a 
j=l \k=j+1 
The first dimension, Dı, does not appear in the formula. 
For the columnwise representation, the general formula would be: 


i1 + Dı X i2 +--+ Di X Da X- xX Dn-2 X in-1 + D1 X Do X +++ x Dn-1 X in 


or in über math geek notation: 


n j—1 
SS Dr | ij 
j=1 \k=1 


In this case, it is the last dimension, Dn, that does not appear in the formula. 


Passing Multidimensional Arrays as Parameters in C 


The rowwise representation of multidimensional arrays has a direct effect 
in C programming. For one dimensional arrays, the size of the array is not 
required to compute where any specific element is located in memory. This is 
not true for multidimensional arrays. To access the elements of these arrays, 
the compiler must know all but the first dimension. This becomes apparent 
when considering the prototype of a function that takes a multidimensional 
array as a parameter. The following will not compile: 


void f( int a| ][ ] ); = no dimension information */ 


This is where you can tell 
the author was a physics 
major. (Or was the refer- 
ence to FORTRAN a give- 
away?) 
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However, the following does compile: 
void f( int a| ][2] ); 


Any two dimensional array with two columns can be passed to this function. 
The first dimension is not required?. 
Do not be confused by a function with this prototype: 


void f( int xaf ] ); 


This defines a single dimensional array of integer pointers (which incidently 
can be used to create an array of arrays that acts much like a two dimensional 
array). 

For higher dimensional arrays, all but the first dimension of the array 
must be specified for parameters. For example, a four dimensional array 
parameter might be passed like: 


void f( int a| ][4][3][2] ); 


5.2 Array/String Instructions 


The 80x86 family of processors provide several instructions that are de- 
signed to work with arrays. These instructions are called string instructions. 
They use the index registers (ESI and EDI) to perform an operation and 
then to automatically increment or decrement one or both of the index reg- 
isters. The direction flag (DF) in the FLAGS register determines where the 
index registers are incremented or decremented. There are two instructions 
that modify the direction flag: 


CLD clears the direction flag. In this state, the index registers are incre- 
mented. 


STD sets the direction flag. In this state, the index registers are decre- 
mented. 


A very common mistake in 80x86 programming is to forget to explicitly put 
the direction flag in the correct state. This often leads to code that works 
most of the time (when the direction flag happens to be in the desired state), 
but does not work all the time. 


5.2.1 Reading and writing memory 


The simplest string instructions either read or write memory or both. 
They may read or write a byte, word or double word at a time. Figure 5.7 


2A size can be specified here, but it is ignored by the compiler. 
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LODSB AL = [DS:ESI] STOSB [ES:EDI] = AL 
ESI = ESI + 1 EDI = EDI + 1 
LODSW AX = [DS:ESI] STOSW [ES:EDI] = AX 
ESI = ESI + 2 EDI = EDI + 2 
LODSD EAX = [DS:ESI] STOSD [ES:EDI] = EAX 
ESI = ESI + EDI = EDI + 4 


Figure 5.7: Reading and writing string instructions 


segment .data 
arrayl dd 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 


segment .bss 
array2 resd 10 


segment .text 


lp: 


cld ; don’t forget this! 
mov esi, array1 

mov edi, array2 

mov ecx, 10 

lodsd 

stosd 

loop lp 


Figure 5.8: Load and store example 


shows these instructions with a short pseudo-code description of what they 
do. There are several points to notice here. First, ESI is used for reading and 
EDI for writing. It is easy to remember this if one remembers that SI stands 
for Source Index and DI stands for Destination Index. Next, notice that the 
register that holds the data is fixed (either AL, AX or EAX). Finally, note 
that the storing instructions use ES to detemine the segment to write to, 
not DS. In protected mode programming this is not usually a problem, since 
there is only one data segment and ES should be automatically initialized 
to reference it (just as DS is). However, in real mode programming, it is 
very important for the programmer to initialize ES to the correct segment 
selector value. Figure 5.8 shows an example use of these instructions that 


3 Another complication is that one can not copy the value of the DS register into the ES 
register directly using a single MOV instruction. Instead, the value of DS must be copied to 
a general purpose register (like AX) and then copied from that register to ES using two 
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MOVSB byte [ES:EDI] = byte [DS:ESTI] 
ESI = ESI + 1 
EDI = EDI + 1 

MOVSW word [ES:EDI] = word [DS:EST] 
ESI = ESI + 2 
EDI = EDI + 2 

MOVSD dword [ES:EDI] = dword [DS:ESTI] 
ESI = ESI + 4 
EDI = EDI + 4 


Figure 5.9: Memory move string instructions 


segment .bss 
array resd 10 


segment .text 


cld ; don’t forget this! 
mov edi, array 

mov ecx, 10 

xor eax, eax 


rep stosd 


Figure 5.10: Zero array example 


copies an array into another. 

The combination of a LODSx and STOSx instruction (as in lines 13 and 14 
of Figure 5.8) is very common. In fact, this combination can be performed 
by a single MOVSx string instruction. Figure 5.9 describes the operations that 
these instructions perform. Lines 13 and 14 of Figure 5.8 could be replaced 
with a single MOVSD instruction with the same effect. The only difference 
would be that the EAX register would not be used at all in the loop. 


5.2.2 The REP instruction prefix 


The 80x86 family provides a special instruction prefix* called REP that 
can be used with the above string instructions. This prefix tells the CPU 
to repeat the next string instruction a specified number of times. The ECX 


MOV instructions. 

*A instruction prefix is not an instruction, it is a special byte that is placed before a 
string instruction that modifies the instructions behavior. Other prefixes are also used to 
override segment defaults of memory accesses 
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CMPSB compares byte [DS:ESI] and byte [ES:EDI] 
ESI = ESI + 1 
EDI = EDI + 1 

CMPSW compares word [DS:ESI] and word [ES:EDI] 
ESI = ESI + 2 
EDI = EDI + 2 

CMPSD compares dword [DS:ESI] and dword [ES:EDI] 
ESI = ESI + 4 
EDI = EDI = 4 

SCASB compares AL and [ES:EDI] 


EDI + 1 

SCASW compares AX and [ES:EDI] 
EDI + 2 

SCASD compares EAX and [ES:EDI] 
EDI + 4 


Figure 5.11: Comparison string instructions 


register is used to count the iterations (just as for the LOOP instruction). 
Using the REP prefix, the loop in Figure 5.8 (lines 12 to 15) could be replaced 
with a single line: 


rep movsd 


Figure 5.10 shows another example that zeroes out the contents of an array. 


5.2.3 Comparison string instructions 


Figure 5.11 shows several new string instructions that can be used to 
compare memory with other memory or a register. They are useful for 
comparing or searching arrays. They set the FLAGS register just like the 
CMP instruction. The CMPSx instructions compare corresponding memory 
locations and the SCASx scan memory locations for a specific value. 

Figure 5.12 shows a short code snippet that searches for the number 12 
in a double word array. The SCASD instruction on line 10 always adds 4 to 
EDI, even if the value searched for is found. Thus, if one wishes to find the 
address of the 12 found in the array, it is necessary to subtract 4 from EDI 
(as line 16 does). 


5.2.4 The REPx instruction prefixes 


There are several other REP-like instruction prefixes that can be used 
with the comparison string instructions. Figure 5.13 shows the two new 
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1 |segment .bss 

2 jarray resd 100 

3 

4 |segment .text 

5 cld 

6 mov edi, array ; pointer to start of array 
7 mov ecx, 100 ; number of elements 

8 mov eax, 12 ; number to scan for 

9 |lp: 

10 scasd 

11 je found 

12 loop lp 

13 ; code to perform if not found 

14 jmp onward 

15 |found: 

16 sub edi, 4 ; edi now points to 12 in array 
17 ; code to perform if found 

18 |onward: 


Why can one not just look 
to see if ECX is zero after 
the repeated comparison? 


Figure 5.12: Search example 


REPE, REPZ repeats instruction while Z flag is set or at most ECX times 


REPNE, REPNZ | repeats instruction while Z flag is cleared or at most ECX 


times 


Figure 5.13: REPx instruction prefixes 


prefixes and describes their operation. REPE and REPZ are just synonyms 
for the same prefix (as are REPNE and REPNZ). If the repeated comparison 
string instruction stops because of the result of the comparison, the index 
register or registers are still incremented and ECX decremented; however, 
the FLAGS register still holds the state that terminated the repetition. 
Thus, it is possible to use the Z flag to determine if the repeated comparisons 
stopped because of a comparison or ECX becoming zero. 


Figure 5.14 shows an example code snippet that determines if two blocks 
of memory are equal. The JE on line 7 of the example checks to see the result 
of the previous instruction. If the repeated comparison stopped because it 
found two unequal bytes, the Z flag will still be cleared and no branch is 
made; however, if the comparisons stopped because ECX became zero, the 
Z flag will still be set and the code branches to the equal label. 
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segment .text 


cld 

mov esi, blockl ; address of first block 
mov edi, block2 ; address of second block 
mov ecx, size ; size of blocks in bytes 
repe cmpsb ; repeat while Z flag is set 
je equal ; if Z set, blocks equal 


; code to perform if blocks are not equal 
jmp onward 
equal: 
; code to perform if equal 
onward: 


Figure 5.14: Comparing memory blocks 


5.2.5 Example 


This section contains an assembly source file with several functions that 
implement array operations using string instructions. Many of the functions 
duplicate familiar C library functions. 


memory.asm 


global _asm_copy, _asm_find, _asm_strlen, _asm_strcpy 


segment .text 

; function _asm_copy 

; copies blocks of memory 

; C prototype 

; void asm_copy( void * dest, const void * src, unsigned sz); 
; parameters: 

; dest - pointer to buffer to copy to 

; src - pointer to buffer to copy from 

; SZ - number of bytes to copy 


; next, some helpful symbols are defined 
define dest [ebp+8] 


%define src [ebp+12] 
define sz [ebp+16] 


_asm_copy: 
enter 0, O 
push esi 
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PS 
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push 


mov 
mov 
mov 


cld 
rep 


pop 
pop 
leave 
ret 


edi 

esi, src 
edi, dest 
ecx, SZ 


movsb 


edi 
esi 


function _asm_find 
searches memory for a given byte 
void * asm_find( const void * src, char target, unsigned sz); 
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; esi = address of buffer to copy from 
; edi = address of buffer to copy to 
; ecx = number of bytes to copy 


; clear direction flag 
; execute movsb ECX times 


parameters: 
src - pointer to buffer to search 
target - byte value to search for 
SZ - number of bytes in buffer 


return value: 


if target is found, pointer to first occurrence of target in buffer 


is returned 


else 


NULL is returned 
NOTE: target is a byte value, but is pushed on stack as a dword value. 
The byte value is stored in the lower 8-bits. 


%define src 
%define target [ebp+12] 
%define sz 


_asm_find: 


enter 
push 


mov 
mov 
mov 
cld 


Lebp+8] 


[ebp+16] 


0,0 
edi 


eax, target 


edi, src 
ecx, SZ 


; al has value to search for 
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repne scasb ; scan until ECX == 0 or [ES:EDI] == AL 
je found_it ; if zero flag set, then found value 
mov eax, 0 ; if not found, return NULL pointer 
jmp short quit 

found_it: 
mov eax, edi 
dec eax ; if found return (DI - 1) 

quit: 
pop edi 
leave 
ret 


$ 


function _asm_strlen 
returns the size of a string 
unsigned asm_strlen( const char * ); 
parameter: 
src - pointer to string 
return value: 
number of chars in string (not counting, ending 0) (in EAX) 


define src [ebp + 8] 
_asm_strlen: 


3 


> 


enter 0,0 


push edi 

mov edi, src ; edi = pointer to string 
mov ecx, OFFFFFFFFh ; use largest possible ECX 
xor al,al ; al = 0 

cld 

repnz scasb ; scan for terminating 0 


repnz will go one step too far, so length is FFFFFFFE - ECX, 
not FFFFFFFF - ECX 


mov eax, OFFFFFFFEh 
sub eax, ecx ; length = OFFFFFFFEh - ecx 
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pop edi 
leave 
ret 


; function _asm_strcpy 

; copies a string 

; void asm_strcpy( char * dest, const char * src); 
; parameters: 

; dest - pointer to string to copy to 

g src - pointer to string to copy from 


%define dest [ebp + 8] 
ń%define src [ebp + 12] 
_asm_strcpy: 

enter 0,0 


push esi 
push edi 
mov edi, dest 
mov esi, src 
cld 

cpy_loop: 
lodsb ; load AL & inc si 
stosb ; store AL & inc di 
or al, al ; set condition flags 
jnz cpy_loop ; if not past terminating 0, continue 
pop edi 
pop esi 
leave 
ret 


memory.asm 


memex.c 


1 include <stdio.h> 


3 #define STR_SIZE 30 
4 /* prototypes */ 


6 void asm_copy( void x, const void x, unsigned ) __attribute__((cdecl )); 
7 void x asm_find( const void x, 
8 char target, unsigned ) __attribute__((cdecl)); 
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unsigned asm-_strlen( const char x ) __attribute__((cdecl )); 
void asm_strcpy( char x, const char x ) __attribute__((cdecl )); 


int main() 
{ 
char st1[STR_SIZE] = "test string” ; 
char st2[STR_SIZE]; 
char « st; 
char ch; 


asm_copy(st2, stl, STR_SIZE); /* copy all 30 chars of string */ 
printf ("%s\n", st2); 


printf ("Enter a char: "); /* look for byte in string */ 
scanf (" %c%x[*\n]", &ch); 
st = asm_find(st2, ch, STR-SIZE); 
if ( st ) 
printf ("Found it: %s\n", st); 
else 
printf (" Not found\n” ); 


st1 [0] = 0; 

printf ("Enter string :” ); 

scanf(”%s", st1); 

printf ("len = %u\n", asm_strlen(st1)); 


asm_strcpy( st2, stl); /* copy meaningful data in string */ 
printf ("%s\n", st2 ); 


return 0; 


memex.c 
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Chapter 6 


Floating Point 


6.1 Floating Point Representation 


6.1.1 Non-integral binary numbers 


When number systems were discussed in the first chapter, only integer 
values were discussed. Obviously, it must be possible to represent non- 
integral numbers in other bases as well as decimal. In decimal, digits to the 
right of the decimal point have associated negative powers of ten: 


0.123 = 1 x 1071 + 2 x 107? +3 x 107° 
Not surprisingly, binary numbers work similarly: 
0.1012 = 1 x 27! +0 x 27? +1 x 27’ = 0.625 


This idea can be combined with the integer methods of Chapter 1 to convert 
a general number: 


110.0112 = 4 + 2 + 0.25 + 0.125 = 6.375 


Converting from decimal to binary is not very difficult either. In general, 
divide the decimal number into two parts: integer and fraction. Convert the 
integer part to binary using the methods from Chapter 1. The fractional 
part is converted using the method described below. 

Consider a binary fraction with the bits labeled a, b,c,... The number 
in binary then looks like: 

O.abcdef ... 


Multiply the number by two. The binary representation of the new number 
will be: 
a.bcdef ... 


117 


www.dbooks.org 


118 CHAPTER 6. FLOATING POINT 


0.5625 x 2 = 1.125 first bit = 1 
0.125x2 = 0.25 second bit = 0 
0.25x2 = 0.5 third bit = 0 
05x2 = 1.0 fourth bit = 1 


Figure 6.1: Converting 0.5625 to binary 


0.85x2 = 17 
07x2 = 14 
0.4x2 = 0.8 
0.8x2 = 1.6 
06x22 = 1.2 
0.2x2 = 0.4 
0.4x2 = 0.8 
0.8x2 = 1.6 


Figure 6.2: Converting 0.85 to binary 


Note that the first bit is now in the one’s place. Replace the a with 0 to get: 
0.bcdef ... 

and multiply by two again to get: 
b.cdef... 


Now the second bit (b) is in the one’s position. This procedure can be 
repeated until as many bits are needed are found. Figure 6.1 shows a real 
example that converts 0.5625 to binary. The method stops when a fractional 
part of zero is reached. 

As another example, consider converting 23.85 to binary. It is easy to 
convert the integral part (23 = 101112), but what about the fractional part 
(0.85)? Figure 6.2 shows the beginning of this calculation. If one looks at 
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the numbers carefully, an infinite loop is found! This means that 0.85 is a 
repeating binary (as opposed to a repeating decimal in base 10). There is 
a pattern to the numbers in the calculation. Looking at the pattern, one 
can see that 0.85 = 0.110110. Thus, 23.85 = 10111.1101102. 

One important consequence of the above calculation is that 23.85 can 
not be represented exactly in binary using a finite number of bits. (Just 
as t can not be represented in decimal with a finite number of digits.) As 
this chapter shows, float and double variables in C are stored in binary. 
Thus, values like 23.85 can not be stored exactly into these variables. Only 
an approximation of 23.85 can be stored. 

To simplify the hardware, floating point numbers are stored in a con- 
sistent format. This format uses scientific notation (but in binary, using 
powers of two, not ten). For example, 23.85 or 10111.11011001100110...2 
would be stored as: 


1.011111011001100110... x 2! 


(where the exponent (100) is in binary). A normalized floating point number 
has the form: 


l.ssssssssssssssss x 2°°°€ 


where l.sssssssssssss is the significand and eeeeeeee is the exponent. 


6.1.2 IEEE floating point representation 


The IEEE (Institute of Electrical and Electronic Engineers) is an inter- 
national organization that has designed specific binary formats for storing 
floating point numbers. This format is used on most (but not all!) com- 
puters made today. Often it is supported by the hardware of the computer 
itself. For example, Intel’s numeric (or math) coprocessors (which are built 
into all its CPUs since the Pentium) use it. The IEEE defines two different 
formats with different precisions: single and double precision. Single preci- 
sion is used by float variables in C and double precision is used by double 
variables. 

Intel’s math coprocessor also uses a third, higher precision called ezr- 
tended precision. In fact, all data in the coprocessor itself is in this precision. 
When it is stored in memory from the coprocessor it is converted to either 
single or double precision automatically.” Extended precision uses a slightly 
different general format than the IEEE float and double formats and so will 
not be discussed here. 


1Tt should not be so surprising that a number might repeat in one base, but not another. 
Think about ee it repeats in decimal, but in ternary (base 3) it would be 0.13. 

?Some compiler’s (such as Borland) long double type uses this extended precision. 
However, other compilers use double precision for both double and long double. (This 


is allowed by ANSI C.) 
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One should always keep in 
mind that the bytes 41 BE 
CC CD can be interpreted 
different ways depending 
on what a program does 
with them! As as single 
precision floating point 
number, they represent 
23.850000381, but as a 
double word integer, they 
represent 1,103,028,309! 
The CPU does not know 
which is the correct 
interpretation! 
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31 30 23: 22 0 
s e f 


s sign bit - 0 = positive, 1 = negative 

e biased exponent (8-bits) = true exponent + 7F (127 decimal). The 
values 00 and FF have special meaning (see text). 

f fraction - the first 23-bits after the 1. in the significand. 


Figure 6.3: IEEE single precision 


IEEE single precision 


Single precision floating point uses 32 bits to encode the number. It 
is usually accurate to 7 significant decimal digits. Floating point numbers 
are stored in a much more complicated format than integers. Figure 6.3 
shows the basic format of a IEEE single precision number. There are sev- 
eral quirks to the format. Floating point numbers do not use the two’s 
complement representation for negative numbers. They use a signed mag- 
nitude representation. Bit 31 determines the sign of the number as shown. 

The binary exponent is not stored directly. Instead, the sum of the 
exponent and 7F is stored from bit 23 to 30. This biased exponent is always 
non-negative. 


The fraction part assumes a normalized significand (in the form 1.sssssssss). 


Since the first bit is always a one, the leading one is not stored! This allows 
the storage of an additional bit at the end and so increases the precision 
slightly. This idea is know as the hidden one representation. 

How would 23.85 be stored? First, it is positive so the sign bit is 0. Next 
the true exponent is 4, so the biased exponent is 7F + 4 = 8316. Finally, the 
fraction is 01111101100110011001100 (remember the leading one is hidden). 
Putting this all together (to help clarify the different sections of the floating 
point format, the sign bit and the fraction have been underlined and the 
bits have been grouped into 4-bit nibbles): 


0100 0001 1011 1110 1100 1100 1100 1100, = 41BECCCCj¢ 


This is not exactly 23.85 (since it is a repeating binary). If one converts 
the above back to decimal, one finds that it is approximately 23.849998474. 
This number is very close to 23.85, but it is not exact. Actually, in C, 23.85 
would not be represented exactly as above. Since the left-most bit that was 
truncated from the exact representation is 1, the last bit is rounded up to 1. 
So 23.85 would be represented as 41 BE CC CD in hex using single precision. 
Converting this to decimal results in 23.850000381 which is a slightly better 
approximation of 23.85. 
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e=0 and f=0 denotes the number zero (which can not be nor- 
malized) Note that there is a +0 and -0. 
e=0 and f#0 denotes a denormalized number. These are dis- 


cussed in the next section. 

e=FF and f=0 denotes infinity (co). There are both positive 
and negative infinities. 

e=FF and f#0 denotes an undefined result, known as NaN 
(Not a Number). 


Table 6.1: Special values of f and e 


63 62 52 51 0 
s e f 


Figure 6.4: IEEE double precision 


How would -23.85 be represented? Just change the sign bit: C1 BE CC 
CD. Do not take the two’s complement! 


Certain combinations of e and f have special meanings for IEEE floats. 
Table 6.1 describes these special values. An infinity is produced by an 
overflow or by division by zero. An undefined result is produced by an 
invalid operation such as trying to find the square root of a negative number, 
adding two infinities, etc. 


Normalized single precision numbers can range in magnitude from 1.0 x 
27126 (~ 1.1755 x 10735) to 1.11111... x 2" (~ 3.4028 x 10%). 


Denormalized numbers 


Denormalized numbers can be used to represent numbers with magni- 
tudes too small to normalize (i.e. below 1.0 x 2716). For example, consider 
the number 1.0012 x 27129 (~ 1.6530 x 1073°). In the given normalized form, 
the exponent is too small. However, it can be represented in the unnormal- 
ized form: 0.010012 x 2~!?7. To store this number, the biased exponent is 
set to 0 (see Table 6.1) and the fraction is the complete significand of the 
number written as a product with 271?" (i.e. all bits are stored including 
the one to the left of the decimal point). The representation of 1.001 x 27129 
is then: 


0 000 0000 0 001 0010 0000 0000 0000 0000 
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IEEE double precision 


IEEE double precision uses 64 bits to represent numbers and is usually 
accurate to about 15 significant decimal digits. As Figure 6.4 shows, the 
basic format is very similar to single precision. More bits are used for the 
biased exponent (11) and the fraction (52) than for single precision. 

The larger range for the biased exponent has two consequences. The first 
is that it is calculated as the sum of the true exponent and 3FF (1023) (not 
7F as for single precision). Secondly, a large range of true exponents (and 
thus a larger range of magnitudes) is allowed. Double precision magnitudes 
can range from approximately 107308 to 10308, 

It is the larger field of the fraction that is responsible for the increase in 
the number of significant digits for double values. 

As an example, consider 23.85 again. The biased exponent will be 4 + 
3FF = 403 in hex. Thus, the double representation would be: 


0100 0000 0011 0111 1101 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1010 


or 40 37 D9 99 99 99 99 9A in hex. If one converts this back to decimal, 
one finds 23.8500000000000014 (there are 12 zeros!) which is a much better 
approximation of 23.85. 

The double precision has the same special values as single precision’. 
Denormalized numbers are also very similar. The only main difference is 
that double denormalized numbers use 271023 instead of 27127, 


6.2 Floating Point Arithmetic 


Floating point arithmetic on a computer is different than in continuous 
mathematics. In mathematics, all numbers can be considered exact. As 
shown in the previous section, on a computer many numbers can not be 
represented exactly with a finite number of bits. All calculations are per- 
formed with limited precision. In the examples of this section, numbers with 
an 8-bit significand will be used for simplicity. 


6.2.1 Addition 


To add two floating point numbers, the exponents must be equal. If 
they are not already equal, then they must be made equal by shifting the 
significand of the number with the smaller exponent. For example, consider 
10.375 + 6.34375 = 16.71875 or in binary: 


1.0100110 x 23 
+ 1.1001011 x 2? 


3The only difference is that for the infinity and undefined values, the biased exponent 
is 7FF not FF. 
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These two numbers do not have the same exponent so shift the significand 
to make the exponents the same and then add: 


1.0100110 x 23 
+  0.1100110 x 2° 
10.0001100 x 23 


Note that the shifting of 1.1001011 x 2? drops off the trailing one and after 
rounding results in 0.1100110x2?. The result of the addition, 10.0001100 x2 
(or 1.00001100 x 24) is equal to 10000.1102 or 16.75. This is not equal to 
the exact answer (16.71875)! It is only an approximation due to the round 
off errors of the addition process. 

It is important to realize that floating point arithmetic on a computer 
(or calculator) is always an approximation. The laws of mathematics do 
not always work with floating point numbers on a computer. Mathemat- 
ics assumes infinite precision which no computer can match. For example, 
mathematics teaches that (a + b) — b = a; however, this may not hold true 
exactly on a computer! 


6.2.2 Subtraction 


Subtraction works very similarly and has the same problems as addition. 
As an example, consider 16.75 — 15.9375 = 0.8125: 


1.0000110 x 24 
— 1.1111111 x 28 


Shifting 1.1111111 x 2? gives (rounding up) 1.0000000 x 24 


1.0000110 x 24 
— 1.0000000 x 24 
0.0000110 x 24 


0.0000110 x 24 = 0.112 = 0.75 which is not exactly correct. 


6.2.3 Multiplication and division 


For multiplication, the significands are multiplied and the exponents are 
added. Consider 10.375 x 2.5 = 25.9375: 


1.0100110 x 23 
x 1.0100000 x 2! 
10100110 
+ 10100110 
1.10011111000000 x 24 
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Of course, the real result would be rounded to 8-bits to give: 
1.1010000 x 2* = 11010.0002 = 26 


Division is more complicated, but has similar problems with round off 
errors. 


6.2.4 Ramifications for programming 


The main point of this section is that floating point calculations are not 
exact. The programmer needs to be aware of this. A common mistake 
that programmers make with floating point numbers is to compare them 
assuming that a calculation is exact. For example, consider a function named 
f(x) that makes a complex calculation and a program is trying to find the 
function’s roots*. One might be tempted to use the following statement to 
check to see if x is a root: 


if ( f(x) == 0.0) 


But, what if f(x) returns 1 x 10780? This very likely means that x is a 
very good approximation of a true root; however, the equality will be false. 
There may not be any IEEE floating point value of x that returns exactly 
zero, due to round off errors in f(x). 

A much better method would be to use: 


if ( fabs(f(x)) < EPS ) 


where EPS is a macro defined to be a very small positive value (like 1x10~1°). 
This is true whenever f(x) is very close to zero. In general, to compare a 
floating point value (say x) to another (y) use: 


if ( fabs(x — y)/fabs(y) < EPS ) 


6.3 The Numeric Coprocessor 


6.3.1 Hardware 


The earliest Intel processors had no hardware support for floating point 
operations. This does not mean that they could not perform float operations. 
It just means that they had to be performed by procedures composed of 
many non-floating point instructions. For these early systems, Intel did 
provide an additional chip called a math coprocessor. A math coprocessor 
has machine instructions that perform many floating point operations much 
faster than using a software procedure (on early processors, at least 10 times 


4A root of a function is a value x such that f(x) = 0 
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faster!). The coprocessor for the 8086/8088 was called the 8087. For the 
80286, there was a 80287 and for the 80386, a 80387. The 80486DX processor 
integrated the math coprocessor into the 80486 itself. Since the Pentium, all 
generations of 80x86 processors have a built-in math coprocessor; however, it 
is still programmed as if it was a separate unit. Even earlier systems without 
a coprocessor can install software that emulates a math coprocessor. These 
emulator packages are automatically activated when a program executes a 
coprocessor instruction and run a software procedure that produces the same 
result as the coprocessor would have (though much slower, of course). 

The numeric coprocessor has eight floating point registers. Each register 
holds 80 bits of data. Floating point numbers are always stored as 80-bit 
extended precision numbers in these registers. The registers are named STO, 
ST1, ST2,... ST7. The floating point registers are used differently than the 
integer registers of the main CPU. The floating point registers are organized 
as a stack. Recall that a stack is a Last-In First-Out (LIFO) list. STO always 
refers to the value at the top of the stack. All new numbers are added to the 
top of the stack. Existing numbers are pushed down on the stack to make 
room for the new number. 

There is also a status register in the numeric coprocessor. It has several 
flags. Only the 4 flags used for comparisons will be covered: Co, C1, C2 and 
C3. The use of these is discussed later. 


6.3.2 Instructions 
To make it easy to distinguish the normal CPU instructions from copro- 
cessor ones, all the coprocessor mnemonics start with an F. 


Loading and storing 


There are several instructions that load data onto the top of the copro- 
cessor register stack: 


FLD source loads a floating point number from memory onto the top of 
the stack. The source may be a single, double or extended 


precision number or a coprocessor register. 


FILD source reads an integer from memory, converts it to floating point 
and stores the result on top of the stack. The source may be 


either a word, double word or quad word. 
FLD1 stores a one on the top of the stack. 
FLDZ stores a zero on the top of the stack. 
There are also several instructions that store data from the stack into 
memory. Some of these instructions also pop (i.e. remove) the number from 


°However, the 80486SX did not have have an integrated coprocessor. There was a 
separate 80487SX chip for these machines. 
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the stack as it stores it. 


FST dest 


FSTP dest 


FIST dest 


FISTP dest 


stores the top of the stack (STO) into memory. The destina- 
tion may either be a single or double precision number or a 
coprocessor register. 

stores the top of the stack into memory just as FST; however, 
after the number is stored, its value is popped from the stack. 
The destination may either a single, double or extended pre- 
cision number or a coprocessor register. 

stores the value of the top of the stack converted to an integer 
into memory. The destination may either a word or a double 
word. The stack itself is unchanged. How the floating point 
number is converted to an integer depends on some bits in 
the coprocessor’s control word. This is a special (non-floating 
point) word register that controls how the coprocessor works. 
By default, the control word is initialized so that it rounds 
to the nearest integer when it converts to integer. However, 
the FSTCW (Store Control Word) and FLDCW (Load Control 
Word) instructions can be used to change this behavior. 
Same as FIST except for two things. The top of the stack is 
popped and the destination may also be a quad word. 


There are two other instructions that can move or remove data on the 


stack itself. 


FXCH STn exchanges the values in STO and STn on the stack (where n 
is register number from 1 to 7). 

FFREE STn frees up a register on the stack by marking the register as 
unused or empty. 


Addition and subtraction 


Each of the addition instructions compute the sum of STO and another 
operand. The result is always stored in a coprocessor register. 


FADD src 


FADD dest, STO 


FADDP dest or 


STO += src. The src may be any coprocessor register 
or a single or double precision number in memory. 
dest += STO. The dest may be any coprocessor reg- 
ister. 
dest += STO then pop stack. The dest may be any 


FADDP dest, STO coprocessor register. 


FIADD src 


STO += (float) src. Adds an integer to STO. The 
src must be a word or double word in memory. 


There are twice as many subtraction instructions than addition because 


the order of the 


operands is important for subtraction (i.e. a+b=b+a, 


but a — b Æ b—al!). For each instruction, there is an alternate one that 


subtracts in the 


reverse order. These reverse instructions all end in either 
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segment .bss 
array resq SIZE 
sum resq 1 


segment .text 


mov ecx, SIZE 
mov esi, array 
fldz 


lp: 
fadd qword [esi] 
add esi, 8 
loop lp 
fstp  qword sum 


3 


; STO += *(esi) 


STO = 0 


move to next double 


store result into sum 


Figure 6.5: Array sum example 


R or RP. Figure 6.5 shows a short code snippet that adds up the elements 
of an array of doubles. On lines 10 and 13, one must specify the size of 
the memory operand. Otherwise the assembler would not know whether the 
memory operand was a float (dword) or a double (qword). 


FSUB src 


FSUBR src 


FSUB dest, STO 


FSUBR dest, STO 


FSUBP dest or 
FSUBP dest, STO 
FSUBRP dest or 
FSUBRP dest, STO 
FISUB src 


FISUBR src 


STO -= src. The src may be any coprocessor register 
or a single or double precision number in memory. 
STO = sre - STO. The src may be any coproces- 
sor register or a single or double precision number in 
memory. 

dest -= STO. The dest may be any coprocessor reg- 
ister. 

dest = STO - dest. The dest may be any copro- 
cessor register. 

dest -= STO then pop stack. The dest may be any 
coprocessor register. 

dest = STO - dest then pop stack. The dest may 
be any coprocessor register. 


STO -= (float) sre. Subtracts an integer from 
STO. The src must be a word or double word in mem- 
ory. 


STO = (float) sre - STO. Subtracts STO from an 
integer. The src must be a word or double word in 
memory. 


www.dbooks.org 


128 CHAPTER 6. FLOATING POINT 


Multiplication and division 


The multiplication instructions are completely analogous to the addition 


instructions. 
FMUL src STO *= src. The src may be any coprocessor register 

or a single or double precision number in memory. 

dest *= STO. The dest may be any coprocessor reg- 

ister. 

dest *= STO then pop stack. The dest may be any 

coprocessor register. 

STO *= (float) src. Multiplies an integer to STO. 


The src must be a word or double word in memory. 


FMUL dest, STO 


FMULP dest or 
FMULP dest, STO 
FIMUL src 


Not surprisingly, the division instructions are analogous to the subtrac- 
tion instructions. Division by zero results in an infinity. 


FDIV src STO /= src. The src may be any coprocessor register 
or a single or double precision number in memory. 
FDIVR src STO = sre / STO. The src may be any coproces- 


sor register or a single or double precision number in 
memory. 

dest /= STO. The dest may be any coprocessor reg- 
ister. 

dest = STO / dest. The dest may be any copro- 
cessor register. 

dest /= STO then pop stack. The dest may be any 
coprocessor register. 


FDIV dest, STO 


FDIVR dest, STO 


FDIVP dest or 
FDIVP dest, STO 


FDIVRP dest or 
FDIVRP dest, STO 
FIDIV src 


FIDIVR src 


Comparisons 


dest = STO / dest then pop stack. The dest may 
be any coprocessor register. 

STO /= (float) src. Divides STO by an integer. 
The src must be a word or double word in memory. 
STO = (float) sre / STO. Divides an integer by 
STO. The src must be a word or double word in mem- 
ory. 


The coprocessor also performs comparisons of floating point numbers. 
The FCOM family of instructions does this operation. 
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; if (x>y) 

fld qword [x] ; STO =x 

fcomp qword [y] ; compare STO and y 

fstsw ax ; move C bits into FLAGS 

sahf 

jna else_part ; if x not above y, goto else_part 
then_part: 


; code for then part 
jmp end_if 


else_part: 
; code for else part 
end_if: 
Figure 6.6: Comparison example 
FCOM src compares STO and src. The src can be a coprocessor register 
or a float or double in memory. 
FCOMP src compares STO and src, then pops stack. The src can be a 
coprocessor register or a float or double in memory. 
FCOMPP compares STO and ST1, then pops stack twice. 
FICOM sre compares STO and (float) sre. The src can be a word or 
dword integer in memory. 
FICOMP src compares STO and (float)src, then pops stack. The src 
can be a word or dword integer in memory. 
FTST compares STO and 0. 


These instructions change the Co, C1, C2 and C3 bits of the coprocessor 
status register. Unfortunately, it is not possible for the CPU to access these 
bits directly. The conditional branch instructions use the FLAGS register, 
not the coprocessor status register. However, it is relatively simple to trans- 
fer the bits of the status word into the corresponding bits of the FLAGS 
register using some new instructions: 


FSTSW dest 


SAHF 
LAHF 


Stores the coprocessor status word into either a word in mem- 
ory or the AX register. 

Stores the AH register into the FLAGS register. 

Loads the AH register with the bits of the FLAGS register. 


Figure 6.6 shows a short example code snippet. Lines 5 and 6 transfer 
the Co, C1, C2 and C3 bits of the coprocessor status word into the FLAGS 
register. The bits are transfered so that they are analogous to the result 
of a comparison of two unsigned integers. This is why line 7 uses a JNA 


instruction. 
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The Pentium Pro (and later processors (Pentium II and III)) support two 
new comparison operators that directly modify the CPU’s FLAGS register. 
FCOMI src compares STO and src. The src must be a coprocessor reg- 
ister. 
FCOMIP src compares STO and src, then pops stack. The src must be a 
coprocessor register. 
Figure 6.7 shows an example subroutine that finds the maximum of two dou- 
bles using the FCOMIP instruction. Do not confuse these instructions with 
the integer comparison functions (FICOM and FICOMP). 


Miscellaneous instructions 


This section covers some other miscellaneous instructions that the co- 
processor provides. 

FCHS STO = - STO Changes the sign of STO 

FABS STO = |STO| Takes the absolute value of STO 

FSQRT STO = VSTO Takes the square root of STO 

FSCALE STO = STO x 2I8T multiples STO by a power of 2 quickly. ST1 
is not removed from the coprocessor stack. Figure 6.8 shows 
an example of how to use this instruction. 


6.3.3 Examples 
6.3.4 Quadratic formula 


The first example shows how the quadratic formula can be encoded in 
assembly. Recall that the quadratic formula computes the solutions to the 
quadratic equation: 

az’? + br +c=0 


The formula itself gives two solutions for x: xı and x9. 


—b rv b2 — 4ac 


2a 


T1, T2 = 


The expression inside the square root (b? — 4ac) is called the discriminant. 
Its value is useful in determining which of the following three possibilities 
are true for the solutions. 


1. There is only one real degenerate solution. b? — 4ac = 0 
2. There are two real solutions. b? — 4ac > 0 
3. There are two complex solutions. b? — 4ac < 0 


Here is a small C program that uses the assembly subroutine: 
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quadt.c 


1 include <stdio.h> 

3 int quadratic( double, double, double, double x, double x); 
5 int main() 

7 double a,b,c, root1, root2; 

9 printf ("Enter a, b, c: ”); 


io scanf("%lf %lf %if’, &a, &b, &c); 
u if (quadratic( a, b, c, &rootl, &root2) ) 


12 printf (” roots: %.10g %.10g\n”, root1, root2); 
13 else 
14 printf ("No real roots\n” ); 


15 return 0; 


quadt.c 


Here is the assembly routine: 


quad.asm 


2 


function quadratic 
finds solutions to the quadratic equation: 
a*x^2 + b¥x + c = 0 
C prototype: 
int quadratic( double a, double b, double c, 
double * rooti, double *root2 ) 


Parameters: 
a, b, c - coefficients of powers of quadratic equation (see above) 
rooti - pointer to double to store first root in 
root2 - pointer to double to store second root in 


Return value: 
returns 1 if real roots found, else 0 


define a qword [ebp+8] 
define b qword [ebp+16] 
define c qword [ebp+24] 
define root1 dword [ebp+32] 
define root2 dword [ebp+36] 
define disc qword [ebp-8] 
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ydefine one_over_2a 


segment .data 
MinusFour 


segment .text 
global 

_quadratic: 
push 
mov 
sub 
push 


fild 
flad 
flad 
fmulp 
fmulp 
fld 
fld 
fmulp 
faddp 
ftst 
fstsw 
sahf 
jb 
fsqrt 
fstp 
fld1 
flad 
fscale 
fdivp 
fst 
flad 
fld 
fsubrp 
fmulp 
mov 
fstp 
fld 
fld 
fchs 


dw -4 
_quadratic 
ebp 

ebp, esp 
esp, 16 
ebx 


word [MinusFour] ; 


ax 


3 


2 


no_real_solutions 


disc 


sti 
one_over_2a 
b 

disc 

sti 

sti 

ebx, rooti 
qword [ebx] 
b 

disc 
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qword [ebp-16] 


allocate 2 doubles (disc & one_over_2a) 


must save original ebx 


stack -4 

stack: a, -4 

stack: c, a, -4 
stack: a*c, -4 
stack: -4*ax*c 
stack: b, b, -4*axc 
stack: b*b, -4*axc 
stack: b*b - 4*ax*c 


test with O0 


; if disc < 0, no real solutions 


stack: sqrt(b*b - 4*a*c) 
store and pop stack 
stack: 1.0 

stack: a, 1.0 

stack: a * 27(1.0) = 2a, 1 
stack: 1/(2*a) 

stack: 1/(2*a) 

stack: b, 1/(2*a) 

stack: disc, b, 1/(2*a) 
stack: disc - b, 1/(2*a) 
stack: (-b + disc) /(2*a) 
store in *rootl 

stack: b 

stack: disc, b 

stack: -disc, b 


62 


63 


64 


65 


66 


67 


68 


69 


70 


TA 


72 


73 


74 


75 


76 
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fsubrp sti ; stack: -disc - b 

fmul one_over_2a ; stack: (-b - disc)/(2*a) 
mov ebx, root2 

fstp qword [ebx] ; store in *root2 

mov eax, 1 ; return value is 1 

jmp short quit 


no_real_solutions: 


mov eax, 0 ; return value is 0 
quit: 

pop ebx 

mov esp, ebp 

pop ebp 

Tek quad.asm 


18 


6.3.5 Reading array from file 


In this example, an assembly routine reads doubles from a file. Here is 
a short C test program: 


readt.c 


/* 

x This program tests the 32—bit read_doubles() assembly procedure. 

* It reads the doubles from stdin. (Use redirection to read from file .) 
*/ 

#include <stdio.h> 

extern int read-doubles( FILE «, double x, int ); 

#tdefine MAX 100 


int main() 


{ 


int i,n; 
double a[MAX]; 


n = read_doubles( stdin, a, MAX); 


for( i=0; i <n; i++) 
printf ("%3d %g\n", i, afi ]); 
return 0; 


} 
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readt.c 
Here is the assembly routine 
read.asm 
segment .data 
format db "%lf£", 0 ; format for fscanf() 
segment .text 
global _read_doubles 
extern _fscanf 
define SIZEOF_DOUBLE 8 
%define FP dword [ebp + 8] 
#define ARRAYP dword [ebp + 12] 
%define ARRAY_SIZE dword [ebp + 16] 


Y,define TEMP_DOUBLE [ebp - 8] 


3’ 


function _read_doubles 
C prototype: 

int read_doubles( FILE * fp, double * arrayp, int array_size ); 
This function reads doubles from a text file into an array, until 
EOF or array is full. 


Parameters: 
fp - FILE pointer to read from (must be open for input) 
arrayp - pointer to double array to read into 


array_size - number of elements in array 
Return value: 
number of doubles stored into array (in EAX) 


_read_doubles: 


push ebp 

mov ebp,esp 

sub esp, SIZEOF_DOUBLE ; define one double on stack 

push esi ; save esi 

mov esi, ARRAYP ; esi = ARRAYP 

xor edx, edx ; edx = array index (initially 0) 
while_loop: 

cmp edx, ARRAY_SIZE ; is edx < ARRAY_SIZE? 


jnl short quit ; if not, quit loop 
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; call fscanf() to read a double into TEMP_DOUBLE 
; fscanf() might change edx so save it 


push 
lea 
push 
push 
push 
call 
add 
pop 
cmp 
jne 


edx 

eax, TEMP_DOUBLE 
eax 

dword format 


FP 

_fscanf 
esp, 12 
edx 

eax, 1 
short quit 


; copy TEMP_DOUBLE into ARRAYP [edx] 
; (The 8-bytes of the double are copied by two 4-byte copies) 


quit: 


mov 
mov 
mov 
mov 


inc 
jmp 
pop 
mov 
mov 


pop 
ret 


eax, [ebp - 8] 

[esi + 8*edx], eax 
eax, [ebp - 4] 

[esi + 8*edx + 4], eax 


edx 
while_loop 
esi 

eax, edx 
esp, ebp 
ebp 


read.asm 


6.3.6 Finding primes 


3 


save edx 


push &TEMP_DOUBLE 
push &format 
push file pointer 


restore edx 
did fscanf return 1? 
if not, quit loop 


first copy lowest 4 bytes 


; next copy highest 4 bytes 


restore esi 


store return value into eax 


This final example looks at finding prime numbers again. This imple- 
mentation is more efficient than the previous one. It stores the primes it 
has found in an array and only divides by the previous primes it has found 
instead of every odd number to find new primes. 
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One other difference is that it computes the square root of the guess for 
the next prime to determine at what point it can stop searching for factors. 
It alters the coprocessor control word so that when it stores the square root 
as an integer, it truncates instead of rounding. This is controlled by bits 
10 and 11 of the control word. These bits are called the RC (Rounding 
Control) bits. If they are both 0 (the default), the coprocessor rounds when 
converting to integer. If they are both 1, the coprocessor truncates integer 
conversions. Notice that the routine is careful to save the original control 
word and restore it before it returns. 

Here is the C driver program: 


fprime.c 


#include <stdio.h> 
#include <stdlib.h> 


x function find_primes 

x finds the indicated number of primes 
x Parameters: 

x a — array to hold primes 

x n — how many primes to find 

*/ 


extern void find_primes( int « a, unsigned n ); 


int main() 


{ 


int status ; 
unsigned i; 
unsigned max; 
int * a; 


printf ("How many primes do you wish to find? ” ); 
scanf (" %u", &max); 


a = calloc( sizeof(int ), max); 
if (a) { 
find_primes (a, max); 
/* print out the last 20 primes found */ 


for(i= ( max > 20 ) ? max — 20: 0; i < max; i++ ) 
printf ("%3d %d\n", i+1, ali]); 
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32 free (a); 

33 status = 0; 

3a} 

35 else { 

36 fprintf (stderr, "Can not create array of %u ints\n”, max); 
37 status = 1; 


38 } 


40 return status; 


iu 


Here is the assembly routine: 


segment 


-text 
global 


_find_primes 


; function find_primes 


; finds the indicated number of primes 


; Parameters: 


; array 


; C Prototype: 
;extern void find_primes( int * array, unsigned n_find ) 


%define 
%define 
%define 
%define 
%define 
%define 


array 
n_find 
n 
isqrt 


- array to hold primes 
; n_find - how many primes to find 


ebp + 8 
ebp + 12 
ebp - 4 
ebp - 8 


orig_cntl_wd ebp - 10 
new_cntl_wd ebp - 12 


_find_primes: 


enter 


push 
push 


fstcw 
mov 


12,0 


ebx 
esi 


word [orig _cntl_wd] 
ax, [Lorig_cntl_wd] 


fprime.c 


prime2.asm 


we 


we 


we 


number of primes found so far 
floor of sqrt of guess 
original control word 

new control word 


make room for local variables 


save possible register variables 


get current control word 
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or 
mov 
fldcw 


mov 
mov 
mov 
mov 
mov 


This outer loop finds a new prime each iteration, which it adds to the 


ax, OCOOh 
[new_cntl_wd], ax 
word [new_cntl_wd] 


esi, [array] 

dword [esi], 2 
dword [esi + 4], 3 
ebx, 5 

dword [n], 2 
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; set rounding bits to 11 (truncate) 


esi points to array 
array[0] = 2 
array[1] = 3 

ebx = guess = 5 


n=2 


end of the array. Unlike the earlier prime finding program, this function 


does not determine primeness by dividing by all odd numbers. It only 
divides by the prime numbers that it has already found. (That’s why they 
are stored in the array.) 


while_limit: 


$ 


3’ 


mov 
cmp 
jnb 


mov 
push 
fild 
pop 
fsqrt 
fistp 


This inner loop divides guess (ebx) by earlier computed prime numbers 


eax, [n] 
eax, ([n_find] 
short quit_limit 


ecx, 1 

ebx 

dword [esp] 
ebx 


dword [isqrt] 


3 


; while ( n < n_find ) 


ecx is used as array index 
store guess on stack 

load guess onto coprocessor stack 
get guess off stack 
find sqrt (guess) 


isqrt 


floor (sqrt (quess)) 


until it finds a prime factor of guess (which means guess is not prime) 


or until the prime number to divide is greater than floor (sqrt (guess) ) 


while_factor: 


mov 
cmp 
jnbe 
mov 
xor 
div 
or 
jz 
inc 


eax, dword [esi + 4*ecx] 


eax, [isqrt] 


short quit_factor_prime 


eax, ebx 
edx, edx 
dword [esi + 4*ecx] 
edx, edx 


short quit_factor_not_prime 


ecx 


eax = array [ecx] 


while ( isqrt < array [ecx] 


&& guess % array [ecx] 


try next prime 


I= 0) 
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jmp 


2 


short while_factor 


; found a new prime ! 


3 


quit_factor_prime: 


mov 
mov 
inc 
mov 


eax, [n] 

dword [esi + 4*eax], ebx 
eax 

[n], eax 


quit_factor_not_prime: 


add 
jmp 


quit_limit: 


fldcw 
pop 
pop 


leave 
ret 


ebx, 2 
short while_limit 


word [Lorig_cntl_wd] 
esi 
ebx 


prime2.asm 
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add guess to end of array 


inc n 


try next odd number 


restore control word 
restore register variables 
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global _dmax 


segment .text 
; function _dmax 


; returns the larger of its two double arguments 


; C prototype 

; double dmax( double d1, double d2 ) 
; Parameters: 

: di - first double 

; d2 - second double 

; Return value: 

; larger of d1 and d2 (in STO) 
define d1 ebp+8 

%define d2 ebp+16 


_dmax: 
enter 0, O 
fld qword [d2] 
fld qword [d1] ; 
fcomip stil ; 
jna short d2_bigger 
fcomp sto ; 
fld qword [d1] ; 
jmp short exit 
d2_bigger: ; 
exit: 
leave 
ret 


STO = 
STO = d2 


l 
a 
an 
U 
(ar | 
ah 

ll 
o 
N 


pop d2 from stack 
STO = d1 


if d2 is max, nothing to do 


Figure 6.7: FCOMIP example 
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segment .data 
X 
five 


segment .text 
fild 
fld 
fscale 


dq 2.75 ; converted to double format 
dw 5 
dword [five] ; STO =5 
qword [x] ; STO = 2.75, ST1 = 5 
; STO = 2.75 * 32, ST1 = 5 


Figure 6.8: FSCALE example 
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Chapter 7 


Structures and C++ 


7.1 Structures 


7.1.1 Introduction 


Structures are used in C to group together related data into a composite 
variable. This technique has several advantages: 


1. It clarifies the code by showing that the data defined in the structure 
are intimately related. 


2. It simplifies passing the data to functions. Instead of passing multiple 
variables separately, they can be passed as a single unit. 


3. It increases the locality! of the code. 


From the assembly standpoint, a structure can be considered as an array 
with elements of varying size. The elements of real arrays are always the 
same size and type. This property is what allows one to calculate the address 
of any element by knowing the starting address of the array, the size of the 
elements and the desired element’s index. 

A structure’s elements do not have to be the same size (and usually are 
not). Because of this each element of a structure must be explicitly specified 
and is given a tag (or name) instead of a numerical index. 

In assembly, the element of a structure will be accessed in a similar 
way as an element of an array. To access an element, one must know the 
starting address of the structure and the relative offset of that element from 
the beginning of the structure. However, unlike an array where this offset 
can be calculated by the index of the element, the element of a structure is 
assigned an offset by the compiler. 


See the virtual memory management section of any Operating System text book for 
discussion of this term. 
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Offset Element 


0 x 
2 

y 
6 

Zz 


Figure 7.1: Structure S 


Offset Element 


0 x 
2 | unused 
4 
y 
8 
z 


Figure 7.2: Structure S 


For example, consider the following structure: 


struct S { 
short int x; /* 2—byte integer x/ 
int y; /* 4—byte integer x/ 


double z; /* 8—byte float »x/ 
} 


Figure 7.1 shows how a variable of type S might look in the computer’s 
memory. The ANSI C standard states that the elements of a structure are 
arranged in the memory in the same order as they are defined in the struct 
definition. It also states that the first element is at the very beginning of 
the structure (i.e. offset zero). It also defines another useful macro in the 
stddef .h header file named offsetof(). This macro computes and returns 
the offset of any element of a structure. The macro takes two parameters, 
the first is the name of the type of the structure, the second is the name of 
the element to find the offset of. Thus, the result of offsetof(S, y) would 
be 2 from Figure 7.1. 
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struct S { 
short int x; /* 2—byte integer x/ 
int y; /* 4—byte integer x/ 


double z; /* 8—byte float */ 
} __attribute__(( packed )); 


Figure 7.3: Packed struct using gcc 


7.1.2 Memory alignment 


If one uses the offsetof macro to find the offset of y using the gcc 
compiler, they will find that it returns 4, not 2! Why? Because gcc (and Recall that an address is on 
many other compilers) align variables on double word boundaries by default. a double word boundary if 
In 32-bit protected mode, the CPU reads memory faster if the data starts at 1 is divisible by 4 
a double word boundary. Figure 7.2 shows how the S structure really looks 
using gcc. The compiler inserts two unused bytes into the structure to align 
y (and z) on a double word boundary. This shows why it is a good idea 
to use offsetof to compute the offsets instead of calculating them oneself 
when using structures defined in C. 
Of course, if the structure is only used in assembly, the programmer 
can determine the offsets himself. However, if one is interfacing C and 
assembly, it is very important that both the assembly code and the C code 
agree on the offsets of the elements of the structure! One complication is 
that different C compilers may give different offsets to the elements. For 
example, as we have seen, the gcc compiler creates an S structure that looks 
like Figure 7.2; however, Borland’s compiler would create a structure that 
looks like Figure 7.1. C compilers provide ways to specify the alignment 
used for data. However, the ANSI C standard does not specify how this will 
be done and thus, different compilers do it differently. 
The gcc compiler has a flexible and complicated method of specifying the 
alignment. The compiler allows one to specify the alignment of any type 
using a special syntax. For example, the following line: 


typedef short int unaligned_int __attribute__(( aligned (1))); 


defines a new type named unaligned_int that is aligned on byte boundaries. 
(Yes, all the parenthesis after attribute are required!) The 1 in the 
aligned parameter can be replaced with other powers of two to specify 
other alignments. (2 for word alignment, 4 for double word alignment, etc.) 
If the y element of the structure was changed to be an unaligned_int type, 
gcc would put y at offset 2. However, z would still be at offset 8 since 
doubles are also double word aligned by default. The definition of z’s type 
would have to be changed as well for it to put at offset 6. 
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#pragma pack(push) /* save alignment state «/ 
#pragma pack(1) /* set byte alignment */ 


struct S { 
short int x; /* 2—byte integer «/ 
int y; /* 4—byte integer x/ 


double z; /* 8—byte float »x/ 
E 


#pragma pack(pop) /* restore original alignment */ 


Figure 7.4: Packed struct using Microsoft or Borland 


The gcc compiler also allows one to pack a structure. This tells the 
compiler to use the minimum possible space for the structure. Figure 7.3 
shows how S could be rewritten this way. This form of S would use the 
minimum bytes possible, 14 bytes. 

Microsoft’s and Borland’s compilers both support the same method of 
specifying alignment using a #pragma directive. 


# pragma pack(1) 


The directive above tells the compiler to pack elements of structures on 
byte boundaries (i.e., with no extra padding). The one can be replaced 
with two, four, eight or sixteen to specify alignment on word, double word, 
quad word and paragraph boundaries, respectively. The directive stays in 
effect until overridden by another directive. This can cause problems since 
these directives are often used in header files. If the header file is included 
before other header files with structures, these structures may be laid out 
differently than they would by default. This can lead to a very hard to find 
error. Different modules of a program might lay out the elements of the 
structures in different places! 

There is a way to avoid this problem. Microsoft and Borland support 
a way to save the current alignment state and restore it later. Figure 7.4 
shows how this would be done. 


7.1.3 Bit Fields 


Bit fields allow one to specify members of a struct that only use a spec- 
ified number of bits. The size of bits does not have to be a multiple of 
eight. A bit field member is defined like an unsigned int or int member 
with a colon and bit size appended to it. Figure 7.5 shows an example. This 
defines a 32-bit variable that is decomposed in the following parts: 
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struct S { 

unsigned f1 : 3; /« 3—bit field x/ 
unsigned f2 : 10; /*« 10—bit field «/ 
unsigned f3 : 11; /« 11—bit field «/ 
unsigned f4 : 8;  /* 8—bit field «/ 
J 


Figure 7.5: Bit Field Example 


Byte \ Bit | 7 6 5 4 3 2 1 0 
Operation Code (08h) 
Logical Unit # msb of LBA 
middle of Logical Block Address 
lsb of Logicial Block Address 
Transfer Length 
Control 


OU e| wi yoj ej o 


Figure 7.6: SCSI Read Command Format 


8 bits 11 bits 10 bits 3 bits 
f4 f3 f2 fl 


The first bitfield is assigned to the least significant bits of its double word.? 

However, the format is not so simple if one looks at how the bits are 
actually stored in memory. The difficulty occurs when bitfields span byte 
boundaries. Because the bytes on a little endian processor will be reversed 
in memory. For example, the S struct bitfields will look like this in memory: 


5 bits 3 bits 3 bits 5 bits 8 bits 8 bits 
f21 fl {£31 | f2m f3m f4 | 


The f2l label refers to the last five bits (7.e., the five least significant bits) 
of the f2 bit field. The f2m label refers to the five most significant bits of 
f2. The double vertical lines show the byte boundaries. If one reverses all 
the bytes, the pieces of the f2 and f3 fields will be reunited in the correct 
place. 

The physical memory layout is not usually important unless the data is 
being transfered in or out of the program (which is actually quite common 
with bit fields). It is common for hardware devices interfaces to use odd 
number of bits that bitfields could be useful to represent. 


? Actually, the ANSI/ISO C standard gives the compiler some flexibility in exactly how 
the bits are laid out. However, common C compilers (gcc, Microsoft and Borland) will 
lay the fields out like this. 
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#define MS_OR_BORLAND (defined(_BORLANDC_.) \ 
|| defined (_MSC_VER)) 


#if MS_OR_BORLAND 
# pragma pack(push) 
# pragma pack(1) 
#tendif 


struct SCSI_read_cmd { 
unsigned opcode : 8; 
unsigned |ba_msb : 5; 
unsigned logical_unit : 3; 
unsigned Iba_mid: 8; = /« middle bits «/ 


unsigned lba_Isb : 8; 
unsigned transfer_length : 8; 
unsigned control : 8; 

} 


#if defined (_.GNUC_) 
__attribute__ ((packed)) 
#tendif 


#if MS_-OR BORLAND 
# pragma pack(pop) 
#endif 


Figure 7.7: SCSI Read Command Format Structure 


One example is SCSI?. A direct read command for a SCSI device is spec- 
ified by sending a six byte message to the device in the format specified in 
Figure 7.6. The difficulty representing this using bitfields is the logical block 
address which spans 3 different bytes of the command. From Figure 7.6, 
one sees that the data is stored in big endian format. Figure 7.7 shows 
a definition that attempts to work with all compilers. The first two lines 
define a macro that is true if the code is compiled with the Microsoft or 
Borland compilers. The potentially confusing parts are lines 11 to 14. First 
one might wonder why the lba_mid and lba_Isb fields are defined separately 
and not as a single 16-bit field? The reason is that the data is in big endian 
order. A 16-bit field would be stored in little endian order by the compiler. 
Next, the Iba_msb and logical_unit fields appear to be reversed; however, 


3Small Computer Systems Interface, an industry standard for hard disks, etc. 


7.1. STRUCTURES 149 
8 bits 8 bits 8 bits 8 bits 3 bits 5 bits 8 bits 
control || transfer_length || Iba_Isb || Iba_mid || logical_unit | lba_msb || opcode 


Figure 7.8: Mapping of SCSI_read_cmd fields 


struct SCSI_read_cmd { 
unsigned char opcode; 
unsigned char lba_msb : 5; 
unsigned char logical_unit : 3; 
unsigned char lba_mid; /* middle bits «/ 
unsigned char |ba_Isb; 
unsigned char transfer_length ; 
unsigned char control; 


#if defined (__GNUC_) 
_-attribute__ ((packed)) 
#tendif 


J 


Figure 7.9: Alternate SCSI Read Command Format Structure 


this is not the case. They have to be put in this order. Figure 7.8 shows 
how the fields are mapped as a 48-bit entity. (The byte boundaries are again 
denoted by the double lines.) When this is stored in memory in little endian 
order, the bits are arranged in the desired format (Figure 7.6). 

To complicate matters more, the definition for the SCSI_read_cmd does 
not quite work correctly for Microsoft C. If the sizeof(SCSI_read_cmd) ex- 
pression is evalutated, Microsoft C will return 8, not 6! This is because the 
Microsoft compiler uses the type of the bitfield in determining how to map 
the bits. Since all the bit fields are defined as unsigned types, the compiler 
pads two bytes at the end of the structure to make it an integral number of 
double words. This can be remedied by making all the fields unsigned short 
instead. Now, the Microsoft compiler does not need to add any pad bytes 
since six bytes is an integral number of two-byte words. The other com- 
pilers also work correctly with this change. Figure 7.9 shows yet another 
definition that works for all three compilers. It avoids all but two of the bit 
fields by using unsigned char. 

The reader should not be discouraged if he found the previous discussion 
confusing. It is confusing! The author often finds it less confusing to avoid 
bit fields altogether and use bit operations to examine and modify the bits 


‘Mixing different types of bit fields leads to very confusing behavior! The reader is 
invited to experiment. 
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manually. 


7.1.4 Using structures in assembly 


As discussed above, accessing a structure in assembly is very much like 
accessing an array. For a simple example, consider how one would write 
an assembly routine that would zero out the y element of an S structure. 
Assuming the prototype of the routine would be: 


void zero_y( S * s_p ); 


the assembly routine would be: 


define y_offset 4 
_zero_y: 
enter 0,0 
mov eax, [ebp + 8] ; get s_p (struct pointer) from stack 
mov dword [eax + y_offset], 0 
leave 
ret 


C allows one to pass a structure by value to a function; however, this 
is almost always a bad idea. When passed by value, the entire data in the 
structure must be copied to the stack and then retrieved by the routine. It 
is much more efficient to pass a pointer to a structure instead. 

C also allows a structure type to be used as the return value of a func- 
tion. Obviously a structure can not be returned in the EAX register. Different 
compilers handle this situation differently. A common solution that com- 
pilers use is to internally rewrite the function as one that takes a structure 
pointer as a parameter. The pointer is used to put the return value into a 
structure defined outside of the routine called. 

Most assemblers (including NASM) have built-in support for defining 
structures in your assembly code. Consult your documentation for details. 


7.2 Assembly and C++ 


The C++ programming language is an extension of the C language. 
Many of the basic rules of interfacing C and assembly language also apply 
to C++. However, some rules need to be modified. Also, some of the 
extensions of C++ are easier to understand with a knowledge of assembly 
language. This section assumes a basic knowledge of C++. 


10 


14 
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#include <stdio.h> 
void f( int x ) 


printf ("%d\n", x); 
} 


void f( double x ) 
{ 


} 


printf (" %g\n", x); 


Figure 7.10: Two £() functions 


7.2.1 Overloading and Name Mangling 


C++ allows different functions (and class member functions) with the 
same name to be defined. When more than one function share the same 
name, the functions are said to be overloaded. If two functions are defined 
with the same name in C, the linker will produce an error because it will 
find two definitions for the same symbol in the object files it is linking. For 
example, consider the code in Figure 7.10. The equivalent assembly code 
would define two labels named -f which will obviously be an error. 

C++ uses the same linking process as C, but avoids this error by per- 
forming name mangling or modifying the symbol used to label the function. 
In a way, C already uses name mangling, too. It adds an underscore to the 
name of the C function when creating the label for the function. However, 
C will mangle the name of both functions in Figure 7.10 the same way and 
produce an error. C++ uses a more sophisticated mangling process that 
produces two different labels for the functions. For example, the first func- 
tion in Figure 7.10 would be assigned by DJGPP the label _f__Fi and the 
second function, _f_-_Fd. This avoids any linker errors. 

Unfortunately, there is no standard for how to manage names in C++ 
and different compilers mangle names differently. For example, Borland 
C++ would use the labels @f$qi and @f$qd for the two functions in Fig- 
ure 7.10. However, the rules are not completely arbitrary. The mangled 
name encodes the signature of the function. The signature of a function is 
defined by the order and the type of its parameters. Notice that the func- 
tion that takes a single int argument has an 7 at the end of its mangled 
name (for both DJGPP and Borland) and that the one that takes a double 
argument has a d at the end of its mangled name. If there was a function 
named f with the prototype: 
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void f( int x, int y, double z); 


DJGPP would mangle its name to be -f__Fiid and Borland to @f$qiid. 

The return type of the function is not part of a function’s signature and 
is not encoded in its mangled name. This fact explains a rule of overloading 
in C++. Only functions whose signatures are unique may be overloaded. As 
one can see, if two functions with the same name and signature are defined 
in C++, they will produce the same mangled name and will create a linker 
error. By default, all C++ functions are name mangled, even ones that are 
not overloaded. When it is compiling a file, the compiler has no way of 
knowing whether a particular function is overloaded or not, so it mangles 
all names. In fact, it also mangles the names of global variables by encoding 
the type of the variable in a similar way as function signatures. Thus, if one 
defines a global variable in one file as a certain type and then tries to use 
it in another file as the wrong type, a linker error will be produced. This 
characteristic of C++ is known as typesafe linking. It also exposes another 
type of error, inconsistent prototypes. This occurs when the definition of a 
function in one module does not agree with the prototype used by another 
module. In C, this can be a very difficult problem to debug. C does not 
catch this error. The program will compile and link, but will have undefined 
behavior as the calling code will be pushing different types on the stack than 
the function expects. In C++, it will produce a linker error. 


When the C++ compiler is parsing a function call, it looks for a matching 
function by looking at the types of the arguments passed to the function?. 
If it finds a match, it then creates a CALL to the correct function using the 
compiler’s name mangling rules. 

Since different compilers use different name mangling rules, C++ code 
compiled by different compilers may not be able to be linked together. This 
fact is important when considering using a precompiled C++ library! If one 
wishes to write a function in assembly that will be used with C++ code, 
she must know the name mangling rules for the C++ compiler to be used 
(or use the technique explained below). 

The astute student may question whether the code in Figure 7.10 will 
work as expected. Since C++ name mangles all functions, then the printf 
function will be mangled and the compiler will not produce a CALL to the 
label _printf. This is a valid concern! If the prototype for printf was 
simply placed at the top of the file, this would happen. The prototype is: 


int printf ( const char x, ...); 


>The match does not have to be an exact match, the compiler will consider matches 
made by casting the arguments. The rules for this process are beyond the scope of this 
book. Consult a C++ book for details. 
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DJGPP would mangle this to be _printf_FPCce. (The F is for function, P 
for pointer, C for const, c for char and e for ellipsis.) This would not call 
the regular C library’s printf function! Of course, there must be a way for 
C++ code to call C code. This is very important because there is a lot of 
useful old C code around. In addition to allowing one to call legacy C code, 
C++ also allows one to call assembly code using the normal C mangling 
conventions. 

C++ extends the extern keyword to allow it to specify that the func- 
tion or global variable it modifies uses the normal C conventions. In C++ 
terminology, the function or global variable uses C linkage. For example, to 
declare printf to have C linkage, use the prototype: 


extern ” C” int printf ( const char *, ..._ ); 


This instructs the compiler not to use the C++ name mangling rules on this 
function, but instead to use the C rules. However, by doing this, the printf 
function may not be overloaded. This provides the easiest way to interface 
C++ and assembly, define the function to use C linkage and then use the C 
calling convention. 

For convenience, C++ also allows the linkage of a block of functions 
and global variables to be defined. The block is denoted by the usual curly 
braces. 


extern ” C” { 
/* C linkage global variables and function prototypes *«/ 


} 


If one examines the ANSI C header files that come with C/C++ com- 
pilers today, they will find the following near the top of each header file: 


#ifdef __cplusplus 
extern "C" { 
#tendif 


And a similar construction near the bottom containing a closing curly brace. 
C++ compilers define the _cplusplus macro (with two leading under- 


scores). The snippet above encloses the entire header file within an extern "C" 


block if the header file is compiled as C++, but does nothing if compiled 
as C (since a C compiler would give a syntax error for extern "C"). This 
same technique can be used by any programmer to create a header file for 
assembly routines that can be used with either C or C++. 


7.2.2 References 


References are another new feature of C++. They allow one to pass 
parameters to functions without explicitly using pointers. For example, 
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void f( int & x ) // the & denotes a reference parameter 
{ x++; } 
int main() 
{ 
int y = 5; 
f(y); // reference to y is passed, note no & here! 
printf ("%d\n", y); // prints out 6! 
return 0; 


} 


Figure 7.11: Reference example 


consider the code in Figure 7.11. Actually, reference parameters are pretty 
simple, they really are just pointers. The compiler just hides this from 
the programmer (just as Pascal compilers implement var parameters as 
pointers). When the compiler generates assembly for the function call on 
line 7, it passes the address of y. If one was writing function f in assembly, 
they would act as if the prototype was®: 


void f( int * xp); 


References are just a convenience that are especially useful for opera- 
tor overloading. This is another feature of C++ that allows one to define 
meanings for common operators on structure or class types. For example, a 
common use is to define the plus (+) operator to concatenate string objects. 
Thus, if a and b were strings, a + b would return the concatenation of the 
strings a and b. C++ would actually call a function to do this (in fact, this 
expression could be rewritten in function notation as operator +(a,b)). 
For efficiency, one would like to pass the address of the string objects in- 
stead of passing them by value. Without references, this could be done as 
operator +(&a,&b), but this would require one to write in operator syntax 
as &a + &b. This would be very awkward and confusing. However, by using 
references, one can write it as a + b, which looks very natural. 


7.2.3 Inline functions 


Inline functions are yet another feature of C++’. Inline functions are 
meant to replace the error-prone, preprocessor-based macros that take pa- 
rameters. Recall from C, that writing a macro that squares a number might 


SOf course, they might want to declare the function with C linkage to avoid name 
mangling as discussed in Section 7.2.1 
TC compilers often support this feature as an extension of ANSI C. 
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inline int inline_f ( int x ) 
{ return x*x; } 


int f( int x ) 
{ return x*x; } 


int main() 
{ 

int y, x = 5; 

y = f(x); 

y = inline (x); 

return 0; 
} 

Figure 7.12: Inlining example 

look like: 


#define SQR(x) ((x)*(x)) 


Because the preprocessor does not understand C and does simple sub- 
stitutions, the parenthesis are required to compute the correct answer in 
most cases. However, even this version will not give the correct answer for 
SQR(x++). 

Macros are used because they eliminate the overhead of making a func- 
tion call for a simple function. As the chapter on subprograms demonstrated, 
performing a function call involves several steps. For a very simple function, 
the time it takes to make the function call may be more than the time to 
actually perform the operations in the function! Inline functions are a much 
more friendly way to write code that looks like a normal function, but that 
does not CALL a common block of code. Instead, calls to inline functions are 
replaced by code that performs the function. C++ allows a function to be 
made inline by placing the keyword inline in front of the function defini- 
tion. For example, consider the functions declared in Figure 7.12. The call 
to function f on line 10 does a normal function call (in assembly, assuming 
x is at address ebp-8 and y is at ebp-4): 


push dword [ebp-8] 


call _f 
pop ecx 
mov [ebp-4], eax 


However, the call to function inline_f on line 11 would look like: 
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mov eax, [ebp-8] 
imul eax, eax 
mov [ebp-4], eax 


Actually, C++ uses the 
this keyword to access the 
pointer to the object acted 
on from inside the member 
function. 


In this case, there are two advantages to inlining. First, the inline func- 
tion is faster. No parameters are pushed on the stack, no stack frame is 
created and then destroyed, no branch is made. Secondly, the inline func- 
tion call uses less code! This last point is true for this example, but does 
not hold true in all cases. 

The main disadvantage of inlining is that inline code is not linked and 
so the code of an inline function must be available to all files that use it. 
The previous example assembly code shows this. The call of the non-inline 
function only requires knowledge of the parameters, the return value type, 
calling convention and the name of the label for the function. All this 
information is available from the prototype of the function. However, using 
the inline function requires knowledge of the all the code of the function. 
This means that if any part of an inline function is changed, all source 
files that use the function must be recompiled. Recall that for non-inline 
functions, if the prototype does not change, often the files that use the 
function need not be recompiled. For all these reasons, the code for inline 
functions are usually placed in header files. This practice is contrary to the 
normal hard and fast rule in C that executable code statements are never 
placed in header files. 


7.2.4 Classes 


A C++ class describes a type of object. An object has both data mem- 
bers and function members®. In other words, it’s a struct with data and 
functions associated with it. Consider the simple class defined in Figure 7.13. 
A variable of Simple type would look just like a normal C struct with a 
single int member. The functions are not stored in memory assigned to the 
structure. However, member functions are different from other functions. 
They are passed a hidden parameter. This parameter is a pointer to the 
object that the member function is acting on. 

For example, consider the set_data method of the Simple class of Fig- 
ure 7.13. If it was written in C, it would look like a function that was 
explicitly passed a pointer to the object being acted on as the code in Fig- 
ure 7.14 shows. The -S switch on the DJGPP compiler (and the gcc and 
Borland compilers as well) tells the compiler to produce an assembly file 
containing the equivalent assembly language for the code produced. For 


SOften called member functions in C++ or more generally methods. 
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class Simple { 


public: 
Simple(); // default constructor 
~Simple(); // destructor 


int get_data() const; 


// member functions 


void set_data( int ); 
private: 
int data; 


H 


// member data 
Simple :: Simple() 
{ data = 0; } 


Simple::~ Simple() 
{ /* null body «/ } 


int Simple:: get_data() const 
{ return data; } 


void Simple:: set_data( int x ) 
{ data = x; } 


Figure 7.13: A simple C++ class 


DJGPP and gcc the assembly file ends in an .s extension and unfortu- 
nately uses AT&T assembly language syntax which is quite different from 
NASM and MASM syntaxes®. (Borland and MS compilers generate a file 
with a .asm extension using MASM syntax.) Figure 7.15 shows the output 
of DJGPP converted to NASM syntax and with comments added to clarify 
the purpose of the statements. On the very first line, note that the set_data 
method is assigned a mangled label that encodes the name of the method, 
the name of the class and the parameters. The name of the class is encoded 
because other classes might have a method named set_data and the two 
methods must be assigned different labels. The parameters are encoded so 
that the class can overload the set_data method to take other parameters 
just as normal C++ functions. However, just as before, different compilers 
will encode this information differently in the mangled label. 


°The gce compiler system includes its own assembler called gas. The gas assembler 
uses AT&T syntax and thus the compiler outputs the code in the format for gas. There 
are several pages on the web that discuss the differences in INTEL and AT&T formats. 
There is also a free program named a2i (http://www.multimania.com/placr/a2i.html), 
that converts AT&T format to NASM format. 
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void set_data( Simple x object, int x ) 


{ 


object—>data = x; 


} 


Figure 7.14: C Version of Simple::set_data() 


_set_data__6Simplei: ; mangled name 
push ebp 
mov ebp, esp 
mov eax, [ebp + 8] ; eax = pointer to object (this) 
mov edx, [ebp + 12] ; edx = integer parameter 
mov [eax], edx ; data is at offset 0 
leave 
ret 


Figure 7.15: Compiler output of Simple::set_data( int ) 


Next on lines 2 and 3, the familiar function prologue appears. On line 5, 
the first parameter on the stack is stored into EAX. This is not the x param- 
eter! Instead it is the hidden parameter’? that points to the object being 
acted on. Line 6 stores the x parameter into EDX and line 7 stores EDX into 
the double word that EAX points to. This is the data member of the Simple 
object being acted on, which being the only data in the class, is stored at 
offset 0 in the Simple structure. 


Example 


This section uses the ideas of the chapter to create a C++ class that 
represents an unsigned integer of arbitrary size. Since the integer can be 
any size, it will be stored in an array of unsigned integers (double words). It 
can be made any size by using dynamical allocation. The double words are 
stored in reverse order! (i.e. the least significant double word is at index 
0). Figure 7.16 shows the definition of the Big_int class!?. The size of a 


10 As usual, nothing is hidden in the assembly code! 

'!Why? Because addition operations will then always start processing at the beginning 
of the array and move forward. 

See the code example source for the complete code for this example. The text will 
only refer to some of the code. 
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class Big_int { 
public: 
/* 
x Parameters: 
* size — size of integer expressed as number of 
* normal unsigned int 's 
* initial value — initial value of Big_int as a normal unsigned int 
*/ 
explicit Big_int( sizet size, 
unsigned initial value = 0); 


/* 

x Parameters: 

* size — size of integer expressed as number of 

* normal unsigned int 's 

* initialvalue — initial value of Big_int as a string holding 
* hexadecimal representation of value. 

*/ 

Big_int( size_t size , 


const char * initial-value ); 


Big_int( const Big int & big_int_to_copy ); 
~ Big_int (); 


// returns size of Big_int (in terms of unsigned int 's) 
size_t size() const; 


const Big_int & operator = ( const Big_int & big_int_to_copy ); 
friend Big_int operator + ( const Big_int & opl, 
const Big_int & op2 ); 
friend Big int operator — ( const Big int & opl, 
const Big_int & op2); 
friend bool operator == ( const Big int & opl, 
const Big_int & op2 ); 
friend bool operator < ( const Big int & opl, 
const Big_int & op2); 
friend ostream & operator << ( ostream & os, 
const Big_int & op ); 


private: 

size_t size_; // size of unsigned array 

unsigned « number_; // pointer to unsigned array holding value 
i 


Figure 7.16: Definition of Big int class 
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// prototypes for assembly routines 
extern "C" { 
int add_big_ints( Big_int & res, 
const Big_int & opl, 
const Big_int & op2); 
int sub_big ints( Big-int & res, 
const Big_int & op1, 
const Big_int & op2); 


} 


inline Big_int operator + ( const Big_int & op1, const Big_int & op2) 


{ 


Big_int result (op1. size ()); 
int res = add_big_ints( result, op1, op2); 


if (res == 1) 
throw Big_int :: Overflow (); 
if (res == 2) 


throw Big_int :: Size_mismatch(); 
return result ; 


} 


inline Big_int operator — ( const Big_int & op1, const Big_int & op2) 


{ 


Big_int result (op1. size ()); 
int res = sub_big_ints( result , op1, op2); 


if (res == 1) 
throw Big_int :: Overflow (); 
if (res == 2) 


throw Big_int :: Size_mismatch(); 
return result ; 


Figure 7.17: Big int Class Arithmetic Code 
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Big_int is measured by the size of the unsigned array that is used to store 
its data. The size_ data member of the class is assigned offset zero and the 
number_ member is assigned offset 4. 

To simplify these example, only object instances with the same size ar- 
rays can be added to or subtracted from each other. 

The class has three constructors: the first (line 9) initializes the class 
instance by using a normal unsigned integer; the second (line 18) initializes 
the instance by using a string that contains a hexadecimal value. The third 
constructor (line 21) is the copy constructor. 

This discussion focuses on how the addition and subtraction operators 
work since this is where the assembly language is used. Figure 7.17 shows 
the relevant parts of the header file for these operators. They show how the 
operators are set up to call the assembly routines. Since different compilers 
use radically different mangling rules for operator functions, inline operator 
functions are used to set up calls to C linkage assembly routines. This makes 
it relatively easy to port to different compilers and is just as fast as direct 
calls. This technique also eliminates the need to throw an exception from 
assembly! 

Why is assembly used at all here? Recall that to perform multiple pre- 
cision arithmetic, the carry must be moved from one dword to be added to 
the next significant dword. C++ (and C) do not allow the programmer to 
access the CPU’s carry flag. Performing the addition could only be done by 
having C++ independently recalculate the carry flag and conditionally add 
it to the next dword. It is much more efficient to write the code in assembly 
where the carry flag can be accessed and using the ADC instruction which 
automatically adds the carry flag in makes a lot of sense. 

For brevity, only the add_big ints assembly routine will be discussed 
here. Below is the code for this routine (from big_math. asm): 


big_math.asm 


segment .text 


global add_big_ints, sub_big_ints 


define size_offset 0 
Zdefine number_offset 4 


%define EXIT_OK 0 
#define EXIT_OVERFLOW 1 
A#define EXIT_SIZE_MISMATCH 2 


2 


; Parameters for both add and sub routines 


define res ebp+8 
%define op1 ebp+12 
%define op2 ebp+16 
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add_big_ints: 


push ebp 
mov ebp, esp 
push ebx 
push esi 


push edi 

; 

; first set up esi to point to op1 
; edi to point to op2 
g ebx to point to res 


mov esi, [op1] 
mov edi, [op2] 
mov ebx, [res] 


; make sure that all 3 Big_int’s have the same size 


mov eax, [esi + size_offset] 
cmp eax, [edi + size_offset] 
jne sizes_not_equal ; op1.size_ != op2.size_ 
cmp eax, [ebx + size_offset] 
jne sizes_not_equal ; opl.size_ != res.size_ 
mov ecx, eax ; ecx = size of Big_int’s 


; now, set registers to point to their respective arrays 
7 esi = op1.number_ 
; edi = op2.number_ 
; ebx = res.number_ 


mov ebx, [ebx + number_offset] 

mov esi, [esi + number_offset] 

mov edi, [edi + number_offset] 

cle ; clear carry flag 
xor edx, edx ; edx = 0 


; addition loop 

add_loop: 
mov eax, [Ledit4*edx] 
adc eax, [Lesit4*edx] 
mov [ebx + 4*edx], eax 
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inc edx ; does not alter carry flag 
loop add_loop 


jc overflow 
ok_done: 
xor eax, eax ; return value = EXIT_OK 
jmp done 
overflow: 
mov eax, EXIT_OVERFLOW 
jmp done 
sizes_not_equal: 
mov eax, EXIT_SIZE_MISMATCH 
done: 
pop edi 
pop esi 
pop ebx 
leave 
ret 


big_math.asm 


Hopefully, most of this code should be straightforward to the reader by 
now. Lines 25 to 27 store pointers to the Big_int objects passed to the 
function into registers. Remember that references really are just pointers. 
Lines 31 to 35 check to make sure that the sizes of the three objects’s arrays 
are the same. (Note that the offset of size_ is added to the pointer to access 
the data member.) Lines 44 to 46 adjust the registers to point to the array 
used by the respective objects instead of the objects themselves. (Again, 
the offset of the number_ member is added to the object pointer.) 

The loop in lines 52 to 57 adds the integers stored in the arrays together 
by adding the least significant dword first, then the next least significant 
dwords, etc. The addition must be done in this sequence for extended preci- 
sion arithmetic (see Section 2.1.5). Line 59 checks for overflow, on overflow 
the carry flag will be set by the last addition of the most significant dword. 
Since the dwords in the array are stored in little endian order, the loop starts 
at the beginning of the array and moves forward toward the end. 

Figure 7.18 shows a short example using the Big int class. Note that 
Big_int constants must be declared explicitly as on line 16. This is necessary 
for two reasons. First, there is no conversion constructor that will convert 
an unsigned int to a Big_int. Secondly, only Big_int’s of the same size can 
be added. This makes conversion problematic since it would be difficult to 
know what size to convert to. A more sophisticated implementation of the 
class would allow any size to be added to any other size. The author did not 
wish to over complicate this example by implementing this here. (However, 
the reader is encouraged to do this.) 
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#include " big_int.hpp” 
#include <iostream> 
using namespace std; 


int main() 
{ 
try { 
Big_int b(5,”8000000000000a00b" ); 
Big_int a(5,”80000000000010230" ); 
Big_int c = a + b; 
cout <<a <<” +” << b<<"” =" 
for( int i=0; i < 2; i++) { 
c=c+a; 
cout << "c = 
} 
cout << "c-l= 
Big_int d(5, "12345678" ); 
cout << "d=" << d << endl; 
cout << "c == d” << (c == d) << endl; 
cout << "c >d” << (c > d) << endl; 
} 
catch( const char * str ) { 
cerr << "Caught: ” << str << endl; 


<< c << endl; 


} 


catch( Big_int :: Overflow ) { 
cerr << ” Overflow” << endl; 

} 

catch( Big_int :: Size mismatch ) { 
cerr << "Size mismatch” << endl; 


} 


return 0; 


Figure 7.18: Simple Use of Big_int 


<< c << endl; 


<< c — Big_int(5,1) << endl; 


> w N = 
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#include <cstddef> 
#include <iostream> 
using namespace std; 


class A { 

public: 

void _cdecl m() { cout << "A::m() 
int ad; 


} 


class B : public A { 

public: 

void _cdecl m() { cout << ”B::m()” << endl; } 
int bd; 


F 


void f( A xp ) 
{ 


<< endl; } 


p—>ad = 5; 
p—>m(); 


} 


int main() 
{ 
A a; 
B b; 
cout << "Size of a: " << sizeof(a) 
<<" Offset of ad: " << offsetof(A,ad) << endl; 
cout << "Size of b: " << sizeof(b) 
<<." Offset of ad: ” << offsetof(B,ad) 
<<" Offset of bd: ” << offsetof(B,bd) << endl; 
f(&a); 
f(&b); 
return 0; 


} 


Figure 7.19: Simple Inheritance 
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_f__FP1A: ; mangled function name 
push ebp 
mov ebp, esp 
mov eax, [ebp+8] ; eax points to object 
mov dword [eax], 5 ; using offset 0 for ad 
mov eax, [ebp+8] ; passing address of object to A::m() 
push eax 
call _m__iA ; mangled method name for A::m() 
add esp, 4 
leave 
ret 


Figure 7.20: Assembly Code for Simple Inheritance 


7.2.5 Inheritance and Polymorphism 


Inheritance allows one class to inherit the data and methods of another. 
For example, consider the code in Figure 7.19. It shows two classes, A and 
B, where class B inherits from A. The output of the program is: 


Size of a: 4 Offset of ad: 0 

Size of b: 8 Offset of ad: O Offset of bd: 4 
A::m(Q) 

A::mQ) 


Notice that the ad data members of both classes (B inherits it from A) are 
at the same offset. This is important since the f function may be passed a 
pointer to either an A object or any object of a type derived (i.e. inherited 
from) A. Figure 7.20 shows the (edited) asm code for the function (generated 
by gcc). 

Note that in the output that A’s m method was called for both the a and 
b objects. From the assembly, one can see that the call to A::m(Q is hard- 
coded into the function. For true object-oriented programming, the method 
called should depend on what type of object is passed to the function. This 
is known as polymorphism. C++ turns this feature off by default. One uses 
the virtual keyword to enable it. Figure 7.21 shows how the two classes 
would be changed. None of the other code needs to be changed. Polymor- 
phism can be implemented many ways. Unfortunately, gcc’s implementation 
is in transition at the time of this writing and is becoming significantly more 
complicated than its initial implementation. In the interest of simplifying 
this discussion, the author will only cover the implementation of polymor- 
phism which the Windows based Microsoft and Borland compilers use. This 
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class A { 

public: 

virtual void _-cdecl m() { cout << "A::m()” << endl; } 
int ad; 


} 


class B : public A { 

public: 

virtual void _cdecl m() { cout << "B::m() 
int bd; 


F 


<< endl; } 


Figure 7.21: Polymorphic Inheritance 


implementation has not changed in many years and probably will not change 
in the foreseeable future. 
With these changes, the output of the program changes: 


Size of a: 8 Offset of ad: 4 

Size of b: 12 Offset of ad: 4 Offset of bd: 8 
A::mQ) 

B: :m() 


Now the second call to f calls the B::m() method because it is passed 
a B object. This is not the only change however. The size of an A is now 8 
(and B is 12). Also, the offset of ad is 4, not 0. What is at offset 0? The 
answer to these questions are related to how polymorphism is implemented. 

A C++ class that has any virtual methods is given an extra hidden field 
that is a pointer to an array of method pointers!?. This table is often called 
the vtable. For the A and B classes this pointer is stored at offset 0. The 
Windows compilers always put this pointer at the beginning of the class at 
the top of the inheritance tree. Looking at the assembly code (Figure 7.22) 
generated for function f (from Figure 7.19) for the virtual method version 
of the program, one can see that the call to method m is not to a label. 
Line 9 finds the address of the vtable from the object. The address of the 
object is pushed on the stack in line 11. Line 12 calls the virtual method by 
branching to the first address in the vtable!*. This call does not use a label, 
it branches to the code address pointed to by EDX. This type of call is an 


13For classes without virtual methods C++ compilers always make the class compatible 
with a normal C struct with the same data members. 

“Of course, this value is already in the ECX register. It was put there in line 8 and 
line 10 could be removed and the next line changed to push ECX. The code is not very 
efficient because it was generated without compiler optimizations turned on. 
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?£@@YAXPAVAGQOZ: 
push ebp 
mov ebp, esp 
mov eax, [ebp+8] 


mov dword [eaxt4], 5 ; p->ad = 5; 


mov ecx, [ebp + 8] ; ecx =p 

mov edx, [ecx] ; edx = pointer to vtable 

mov eax, [ebp + 8] ; eax =p 

push eax ; push "this" pointer 

call adword [edx] ; call first function in vtable 
add esp, 4 ; clean up stack 

pop ebp 

ret 


Figure 7.22: Assembly Code for f () Function 


example of late binding. Late binding delays the decision of which method 
to call until the code is running. This allows the code to call the appropriate 
method for the object. The normal case (Figure 7.20) hard-codes a call to a 
certain method and is called early binding (since here the method is bound 
early, at compile time). 

The attentive reader will be wondering why the class methods in Fig- 
ure 7.21 are explicitly declared to use the C calling convention by using 
the _.cdecl keyword. By default, Microsoft uses a different calling conven- 
tion for C++ class methods than the standard C convention. It passes the 
pointer to the object acted on by the method in the ECX register instead 
of using the stack. The stack is still used for the other explicit parameters 
of the method. The -cdecl modifier tells it to use the standard C calling 
convention. Borland C++ uses the C calling convention by default. 

Next let’s look at a slightly more complicated example (Figure 7.23). 
In it, the classes A and B each have two methods: m1 and m2. Remember 
that since class B does not define its own m2 method, it inherits the A class’s 
method. Figure 7.24 shows how the b object appears in memory. Figure 7.25 
shows the output of the program. First, look at the address of the vtable 
for each object. The two B objects’s addresses are the same and thus, they 
share the same vtable. A vtable is a property of the class not an object (like 
a static data member). Next, look at the addresses in the vtables. From 
looking at assembly output, one can determine that the m1 method pointer 
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class A { 

public: 

virtual void cdecl m1() { cout << "A::m1()” << endl; } 
virtual void _cdecl m2() { cout << "A::m2()” << endl; } 
int ad; 


} 


class B : public A{ // B inherits A's m2() 
public: 
virtual void _cdecl m1() { cout << ”B::m1()” << endl; } 
int bd; 
J 
/* prints the vtable of given object */ 
void print_vtable ( A * pa ) 
{ 
// p sees pa as an array of dwords 
unsigned * p = reinterpret_cast<unsigned «>(pa); 
// vt sees vtable as an array of pointers 
void «x vt = reinterpret_cast<void **«>(p[0]); 
cout << hex << "vtable address = ” << vt << endl; 
for( int i=0; i < 2; i++) 
cout << "dword” <<i<<":" << vt[i] << endl; 
// call virtual functions in EXTREMELY non—portable way! 
void (xml1func_pointer)(A x); // function pointer variable 
m1func_pointer = reinterpret_cast<void («)(Ax)>(vt[0]); 
m1func_pointer(pa ); // call method m1 via function pointer 


void («m2func_pointer)(A x); // function pointer variable 
m2func_pointer = reinterpret_cast<void («)(Ax)>(vt[1]); 
m2func_pointer(pa); // call method m2 via function pointer 


} 


int main() 


{ 
Aa; Bbl; B b2; 
cout << "a: ” << endl; print-vtable (&a); 
cout << "bl: ” << endl; print_vtable (&b1); 
cout << "b2: ” << endl; print_vtable (&b2); 


return 0; 


Figure 7.23: More complicated example 
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0 
0 | vtablepe > &B::m1() 
4 ad 4 | &A::m2() 
8 bd vtable 
b1 


Figure 7.24: Internal representation of b1 


a: 


vtable address = 004120E8 
dword 0: 00401320 

dword 1: 00401350 
A::m1Q 

A::m2() 

bi: 

vtable address = 004120F0 
dword 0: 004013A0 

dword 1: 00401350 

B: :m1i() 

A::m2() 

b2: 

vtable address = 004120F0 
dword 0: 004013A0 

dword 1: 00401350 

B: :m1i() 

A::m2() 


Figure 7.25: Output of program in Figure 7.23 
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is at offset 0 (or dword 0) and m2 is at offset 4 (dword 1). The m2 method 
pointers are the same for the A and B class vtables because class B inherits 
the m2 method from the A class. 

Lines 25 to 32 show how one could call a virtual function by reading its 
address out of the vtable for the object!’. The method address is stored into 
a C-type function pointer with an explicit this pointer. From the output in 
Figure 7.25, one can see that it does work. However, please do not write 
code like this! This is only used to illustrate how the virtual methods use 
the vtable. 

There are some practical lessons to learn from this. One important fact 
is that one would have to be very careful when reading and writing class 
variables to a binary file. One can not just use a binary read or write on 
the entire object as this would read or write out the vtable pointer to the 
file! This is a pointer to where the vtable resides in the program’s memory 
and will vary from program to program. This same problem can occur in C 
with structs, but in C, structs only have pointers in them if the programmer 
explicitly puts them in. There are no obvious pointers defined in either the 
A or B classes. 

Again, it is important to realize that different compilers implement vir- 
tual methods differently. In Windows, COM (Component Object Model) 
class objects use vtables to implement COM interfaces!®. Only compilers 
that implement virtual method vtables as Microsoft does can create COM 
classes. This is why Borland uses the same implementation as Microsoft and 
one of the reasons why gcc can not be used to create COM classes. 

The code for the virtual method looks exactly like a non-virtual one. 
Only the code that calls it is different. If the compiler can be absolutely 
sure of what virtual method will be called, it can ignore the vtable and call 
the method directly (e.g., use early binding). 


7.2.6 Other C++ features 


The workings of other C++ features (e.g., RunTime Type Information, 
exception handling and multiple inheritance) are beyond the scope of this 
text. If the reader wishes to go further, a good starting point is The Anno- 
tated C++ Reference Manual by Ellis and Stroustrup and The Design and 
Evolution of C++ by Stroustrup. 


15Remember this code only works with the MS and Borland compilers, not gcc. 
16COM classes also use the __stdcall calling convention, not the standard C one. 
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Appendix A 


80x86 Instructions 


A.1 Non-floating Point Instructions 


This section lists and describes the actions and formats of the non- 
floating point instructions of the Intel 80x86 CPU family. 
The formats use the following abbreviations: 


R general register 
R8 8-bit register 
R16 | 16-bit register 
R32 | 32-bit register 
SR | segment register 


M memory 

M8 | byte 

M16 | word 

M32 | double word 

I immediate value 


These can be combined for the multiple operand instructions. For example, 
the format R, R means that the instruction takes two register operands. 
Many of the two operand instructions allow the same operands. The abbre- 
viation O2 is used to represent these operands: R,R R,M R,I M,R M,I. If 
a 8-bit register or memory can be used for an operand, the abbreviation, 
R/MS is used. 

The table also shows how various bits of the FLAGS register are affected 
by each instruction. If the column is blank, the corresponding bit is not 
affected at all. If the bit is always changed to a particular value, a 1 or 0 is 
shown in the column. If the bit is changed to a value that depends on the 
operands of the instruction, a C is placed in the column. Finally, if the bit 
is modified in some undefined way a ? appears in the column. Because the 
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only instructions that change the direction flag are CLD and STD, it is not 


APPENDIX A. 80X86 INSTRUCTIONS 


listed under the FLAGS columns. 


Flags 
Name Description Formats O S Z A P C 
ADC Add with Carry O2 CICICICICIC 
ADD Add Integers O2 CICICICJICIC 
AND Bitwise AND O2 0IC|CI? ICIO 
BSWAP Byte Swap R32 
CALL Call Routine RMI 
CBW Convert Byte to Word 
CDQ Convert Dword to 
Qword 
CLC Clear Carry 0 
CLD Clear Direction Flag 
CMC Complement Carry C 
CMP Compare Integers O2 CICICICICIC 
CMPSB Compare Bytes CICICICICIC 
CMPSW Compare Words CICICICICIC 
CMPSD Compare Dwords CICICICICIC 
CWD Convert Word to 
Dword into DX:AX 
CWDE Convert Word to 
Dword into EAX 
DEC Decrement Integer RM CICICICIC 
DIV Unsigned Divide RM e can came com |e 
ENTER Make stack frame 1,0 
IDIV Signed Divide RM aie eee ga ee Gama cm (ne ca 
IMUL Signed Multiply R M;|Cy]?]?}?)?4C 
R16,R/M16 
R32,R/M32 
R16,1 
R32,1 
R16,R/M16,I 
R32,R/M32,1 
INC Increment Integer RM CICICICIC 
INT Generate Interrupt I 
JA Jump Above I 
JAE Jump Above or Equal |I 
JB Jump Below I 
JBE Jump Below or Equal |I 
JC Jump Carry I 
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Flags 
Name Description Formats O S Z A P C 

JCXZ Jump if CX = 0 I 

JE Jump Equal I 

JG Jump Greater I 

JGE Jump Greater or | I 
Equal 

JL Jump Less I 

JLE Jump Less or Equal I 

JMP Unconditional Jump RMI 

JNA Jump Not Above I 

JNAE Jump Not Above or | I 
Equal 

JNB Jump Not Below I 

JNBE Jump Not Below or | I 
Equal 

JNC Jump No Carry I 

JNE Jump Not Equal I 

JNG Jump Not Greater I 

JNGE Jump Not Greater or | I 
Equal 

JNL Jump Not Less I 

JNLE Jump Not Less or | I 
Equal 

JNO Jump No Overflow I 

JNS Jump No Sign I 

JNZ Jump Not Zero I 

JO Jump Overflow I 

JPE Jump Parity Even I 

JPO Jump Parity Odd I 

JS Jump Sign I 

JZ Jump Zero I 

LAHF Load FLAGS into AH 

LEA Load Effective Address | R32,M 

LEAVE Leave Stack Frame 

LODSB Load Byte 

LODSW Load Word 

LODSD Load Dword 

LOOP Loop I 

LOOPE/LOOPZ Loop If Equal I 

LOOPNE/LOOPNZ | Loop If Not Equal I 
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Flags 
Name Description Formats O S Z A P C 
MOV Move Data O2 
SR,R/M16 
R/M16,SR 
MOVSB Move Byte 
MOVSW Move Word 
MOVSD Move Dword 
MOVSX Move Signed R16,R/M8 
R32,R/M8 
R32,R/M16 
MOVZX Move Unsigned R16,R/M8 
R32,R/M8 
R32,R/M16 
MUL Unsigned Multiply RM Cy? ]?}? 7? 4,C 
NEG Negate RM C C C 
NOP No Operation 
NOT 1’s Complement RM 
OR Bitwise OR 02 0;C;}C}? }C}0 
POP Pop From Stack R/M16 
R/M32 
POPA Pop All 
POPF Pop FLAGS CICICICICIC 
PUSH Push to Stack R/M16 
R/M32 I 
PUSHA Push All 
PUSHF Push FLAGS 
RCL Rotate Left with Carry | R/M,I C C 
R/M,CL 
RCR Rotate Right with | R/M,I C C 
Carry R/M,CL 
REP Repeat 
REPE/REPZ Repeat If Equal 
REPNE/REPNZ Repeat If Not Equal 
RET Return 
ROL Rotate Left R/M,I C C 
R/M,CL 
ROR Rotate Right R/M,I C C 
R/M,CL 
SAHF Copies AH into CICICIC|C 
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Flags 
Name Description Formats O S Z A P C 
SAL Shifts to Left R/M,I C 
R/M, CL 
SBB Subtract with Borrow | O2 CICICICI|CIC 
SCASB Scan for Byte CICICICICIC 
SCASW Scan for Word CICIC|CICIC 
SCASD Scan for Dword CICIC| CICC 
SETA Set Above R/M8 
SETAE Set Above or Equal R/M8 
SETB Set Below R/M8 
SETBE Set Below or Equal R/M8 
SETC Set Carry R/M8 
SETE Set Equal R/M8 
SETG Set Greater R/M8 
SETGE Set Greater or Equal | R/M8 
SETL Set Less R/M8 
SETLE Set Less or Equal R/M8 
SETNA Set Not Above R/M8 
SETNAE Set Not Above or | R/M8 
Equal 
SETNB Set Not Below R/M8 
SETNBE Set Not Below or | R/M8 
Equal 
SETNC Set No Carry R/M8 
SETNE Set Not Equal R/M8 
SETNG Set Not Greater R/M8 
SETNGE Set Not Greater or | R/M8 
Equal 
SETNL Set Not Less R/M8 
SETNLE Set Not LEss or Equal | R/M8 
SETNO Set No Overflow R/M8 
SETNS Set No Sign R/M8 
SETNZ Set Not Zero R/M8 
SETO Set Overflow R/M8 
SETPE Set Parity Even R/M8 
SETPO Set Parity Odd R/M8 
SETS Set Sign R/M8 
SETZ Set Zero R/M8 
SAR Arithmetic Shift to | R/M,I C 
Right R/M, CL 
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Flags 
Name Description Formats O S Z A P 
SHR Logical Shift to Right | R/M,I 
R/M, CL 
SHL Logical Shift to Left R/M,I 
R/M, CL 
STC Set Carry 
STD Set Direction Flag 
STOSB Store Btye 
STOSW Store Word 
STOSD Store Dword 
SUB Subtract 02 CICICICIIC 
TEST Logical Compare R/M,R 0) Ove. eG 
R/M,I 
XCHG Exchange R/M,R 
R,R/M 
XOR Bitwise XOR O2 0;C;}C}]? |C 


A.2. FLOATING POINT INSTRUCTIONS 


A.2 Floating Point Instructions 


In this section, many of the 80x86 math coprocessor instructions are 
described. The description section briefly describes the operation of the 
instruction. To save space, information about whether the instruction pops 


the stack is not given in the description. 


The format column shows what type of operands can be used with each 


instruction. The following abbreviations are used: 


Ho 


STn | A coprocessor register 
Single precision number in memory 
Double precision number in memory 
Extended precision number in memory 
116 | Integer word in memory 
132 | Integer double word in memory 
164 | Integer quad word in memory 


Instructions requiring a Pentium Pro or better are marked with an as- 


terisk(*). 

Instruction Description Format 
FABS STO = [STO] 
FADD src STO += sre STn F D 
FADD dest, STO dest += STO STn 
FADDP dest [,STO] dest += STO STn 
FCHS STO = —STO 
FCOM srce Compare STO and sre STn F D 
FCOMP src Compare STO and src STn F D 
FCOMPP src Compares STO and ST1 
FCOMI* sre Compares into FLAGS STn 
FCOMIP* sre Compares into FLAGS STn 
FDIV src STO /= sre STn F D 
FDIV dest, STO dest /= STO STn 
FDIVP dest [,STO] dest /= STO STn 
FDIVR sre STO = src/STO STn FD 
FDIVR dest, STO dest = STO/dest STn 
FDIVRP dest[,STO] | dest = STO/dest STn 
FFREE dest Marks as empty STn 
FIADD src STO += src 116 132 
FICOM sre Compare STO and src 116 132 
FICOMP src Compare STO and src 116 132 
FIDIV sre STO /= sre 116 132 
FIDIVR src STO = src/STO 116 [32 
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Instruction Description Format 

FILD src Push sre on Stack 116 132 164 

FIMUL src STO *= src I16 132 

FINIT Initialize Coprocessor 

FIST dest Store STO 116 132 

FISTP dest Store STO I16 132 164 

FISUB sre STO -= src I16 132 

FISUBR src STO = sre - STO T16 132 

FLD src Push sre on Stack STn FDE 

FLD1 Push 1.0 on Stack 

FLDCW src Load Control Word Register | 116 

FLDPI Push 7 on Stack 

FLDZ Push 0.0 on Stack 

FMUL src STO *= src STn F D 

FMUL dest, STO dest *= STO STn 

FMULP dest [,STO] dest *= STO STn 

FRNDINT Round STO 

FSCALE STO = STO x 2L5T1] 

FSQRT STO = /STO 

FST dest Store STO STn F D 

FSTP dest Store STO STn FDE 

FSTCW dest Store Control Word Register | 116 

FSTSW dest Store Status Word Register | I16 AX 

FSUB src STO -= src STn F D 

FSUB dest, STO dest -= STO STn 

FSUBP dest [,STO] dest -= STO STn 

FSUBR src STO = src-STO STn F D 

FSUBR dest, STO dest = STO-dest STn 

FSUBP dest [,STO] dest = STO-dest STn 

FTST Compare STO with 0.0 

FXCH dest Exchange STO and dest STn 


